arxiv_cv 90% Match Research Paper Computer Graphics Researchers,AI Researchers,VR/AR Developers,3D Artists 1 day ago

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

generative-ai › diffusion

📄 Abstract

Abstract: We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30 seconds. We demonstrate the effectiveness of Diff4Splatacross video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.

Authors (11)

Panwang Pan

Chenguo Lin

Jingjing Zhao

Chenxin Li

Yuchen Lin

Haopeng Li

+5 more

Submitted

November 1, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Diff4Splat presents a feed-forward method for synthesizing controllable and explicit 4D scenes from a single image by unifying video diffusion models with 4D data constraints. It directly predicts a deformable 3D Gaussian field encoding appearance, geometry, and motion, eliminating test-time optimization. The core innovation is a video latent transformer that captures spatio-temporal dependencies for predicting time-varying 3D Gaussian primitives, enabling high-quality 4D scene generation.

Business Value

Enables rapid creation of realistic and dynamic 3D environments from single images, accelerating content creation for VR/AR, gaming, and virtual production.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Moderate. Requires significant computational resources for generation, but the feed-forward nature makes inference faster than optimization-based methods.

Limitations Addressed

Difficulty in generating dynamic 4D scenes from limited input (single image),Need for test-time optimization or post-hoc refinement in previous methods,Capturing spatio-temporal dependencies for 4D generation

Technical Tags

4D Scene GenerationDiffusion ModelsLatent Dynamic ReconstructionSingle Image InputControllable Generation3D Gaussian FieldsVideo Latent TransformerSpatio-temporal DependenciesNovel View SynthesisFeed-forward

Research Topics

Generative AI3D Computer VisionDiffusion ModelsScene SynthesisDynamic Scene Reconstruction

Methods & Architectures

Video Latent Transformer3D Gaussian FieldsDiffusion Models Video Latent TransformerDiffusion Models

Applications & Tasks

Computer Graphics Virtual Reality Augmented Reality Film and Animation Robotics Simulation Generating dynamic 4D scenes from static inputControllable scene synthesisEfficient 4D reconstruction 4D Scene GenerationNovel View SynthesisVideo Generation

Datasets & Benchmarks

Datasets

Large-scale 4D datasets

Related Fields

Computer GraphicsGenerative Models3D VisionDeep LearningVirtual Reality

Keywords

4D Scene GenerationDiffusion Models3D Gaussian SplattingGenerative AISingle Image ReconstructionDynamic ScenesNeural RenderingComputer GraphicsVideo GenerationControllable Synthesis

Academic Context

#Generative AI#3D Computer Vision#Diffusion Models#Scene Synthesis#Dynamic Scene Reconstruction

Commercial Potential

Potential Products

Tools for generating virtual environmentsAssets for game developmentContent creation platforms for VR/AR

Target Industries

GamingVirtual RealityAugmented RealityFilm and AnimationArchitecture Visualization

Use Case Examples

Creating a dynamic 3D scene of a street from a single photographGenerating interactive virtual environments for training simulationsSynthesizing realistic animations of moving objects in 3D space

Competitive Edge

Offers a novel feed-forward approach for 4D scene generation from a single image, outperforming methods requiring optimization or multiple views.

Market Opportunity

Rapidly growing market for 3D content creation tools and immersive experiences.

Revenue Models

Licensing of the technologyintegration into content creation software.

Resource Requirements

Compute Needs

High for training, moderate to high for inference (GPU required).

Data Requirements

Large-scale datasets of dynamic 4D scenes with associated camera trajectories.

Deployment Constraints

Computational cost for generation,Quality dependent on training data diversity

Scalability

Scalability to very complex or long dynamic scenes might be a challenge.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

Moderate, for the novel transformer architecture and integration with 3D Gaussians.

View Full Paper Back to Papers