arxiv_cv 95% Match Research Paper Researchers in Computer Vision and Graphics,Developers of VR/AR applications,3D Content Creators,AI Engineers 1 week ago

SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

computer-vision › 3d-vision

📄 Abstract

Abstract: Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.

Authors (11)

Dongyue Lu

Ao Liang

Tianxin Huang

Xiao Fu

Yuyang Zhao

Baorui Ma

+5 more

Submitted

October 30, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

SEE4D introduces a pose-free framework for 4D content generation from casual videos, eliminating the need for manual camera pose annotations. It replaces explicit trajectory prediction with rendering to fixed virtual cameras, separating camera control from scene modeling. This approach simplifies the generation process and improves robustness for in-the-wild footage.

Business Value

Enables easier creation of immersive 3D content for VR/AR applications, virtual tours, and telepresence, potentially lowering production costs and barriers to entry. This can enhance user engagement in digital experiences.

Paper Metadata

Innovation Type

Novel Framework for Pose-Free 4D Generation

Deployment Feasibility

Moderate, requires significant computational resources for training and inference of generative models.

Limitations Addressed

Labor-intensive and brittle nature of manual camera pose annotation,Entanglement of camera motion and scene dynamics in trajectory-to-trajectory methods,Difficulty in generating 4D content from unconstrained videos

Technical Tags

4D GenerationPose-FreeVideo InpaintingAuto-RegressiveView SynthesisCamera TrajectoryScene DynamicsGeometry PriorVirtual CamerasImmersive Applications

Research Topics

3D Scene ReconstructionGenerative AIVideo SynthesisPose EstimationImmersive Technologies

Methods & Architectures

Auto-regressive video inpaintingPose-free 4D generationView-conditional generationRendering to fixed virtual cameras View-conditional video inpainting modelAuto-regressive models

Applications & Tasks

Virtual Reality Augmented Reality 3D Content Creation Telepresence Virtual Tourism Synthesizing 4D content from casual videos without 3D supervisionOvercoming reliance on manually annotated camera posesSeparating camera motion from scene dynamicsSimplifying trajectory-to-trajectory formulations Pose-free 4D content generationSynthesizing novel views of a sceneReconstructing dynamic 4D scenes from monocular videoGenerating immersive experiences

Related Fields

Computer VisionComputer GraphicsGenerative AIVirtual RealityAugmented Reality

Keywords

4D GenerationPose-FreeVideo InpaintingAuto-RegressiveView Synthesis3D ReconstructionGenerative AIVirtual RealityAugmented RealityComputer VisionScene DynamicsCamera ControlImmersive Content

Academic Context

#3D Scene Reconstruction#Generative AI#Video Synthesis#Pose Estimation#Immersive Technologies

Commercial Potential

Potential Products

Tools for generating VR/AR experiences from videosPlatforms for creating virtual toursTelepresence applications3D asset generation services

Target Industries

Virtual RealityAugmented RealityGamingEntertainmentReal EstateTourism

Use Case Examples

Creating interactive 3D walkthroughs of real-world locations from drone footageGenerating dynamic virtual environments for VR gamesDeveloping realistic avatars for telepresence applicationsEnabling virtual tourism experiences from casual videos

Competitive Edge

Offers a more robust and user-friendly approach to 4D generation compared to methods requiring explicit pose estimation.

Market Opportunity

Growing, driven by the demand for immersive content and VR/AR applications.

Revenue Models

SaaS for 4D content creationLicensing of generation technologyPlatform for immersive experiences

Resource Requirements

Compute Needs

High, for training and inference of complex generative models.

Data Requirements

Requires diverse video data, ideally with varying camera motion and scene dynamics.

Deployment Constraints

Computational resources for generation,Quality of input video

Scalability

Scalability depends on the efficiency of the inpainting model and the rendering process.

Production Readiness

Maturity Level

Research and development.

Time to Market

Medium to long term.

Patent Potential

Moderate, related to novel generative architectures or pose-free generation techniques.

View Full Paper Back to Papers