Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Video generation models have progressed tremendously through large latent
diffusion transformers trained with rectified flow techniques. Yet these models
still struggle with geometric inconsistencies, unstable motion, and visual
artifacts that break the illusion of realistic 3D scenes. 3D-consistent video
generation could significantly impact numerous downstream applications in
generation and reconstruction tasks. We explore how epipolar geometry
constraints improve modern video diffusion models. Despite massive training
data, these models fail to capture fundamental geometric principles underlying
visual content. We align diffusion models using pairwise epipolar geometry
constraints via preference-based optimization, directly addressing unstable
camera trajectories and geometric artifacts through mathematically principled
geometric enforcement. Our approach efficiently enforces geometric principles
without requiring end-to-end differentiability. Evaluation demonstrates that
classical geometric constraints provide more stable optimization signals than
modern learned metrics, which produce noisy targets that compromise alignment
quality. Training on static scenes with dynamic cameras ensures high-quality
measurements while the model generalizes effectively to diverse dynamic
content. By bridging data-driven deep learning with classical geometric
computer vision, we present a practical method for generating spatially
consistent videos without compromising visual quality.
Authors (4)
Orest Kupyn
Fabian Manhardt
Federico Tombari
Christian Rupprecht
Submitted
October 24, 2025
Key Contributions
This paper explores how epipolar geometry constraints can improve modern video diffusion models, addressing issues like geometric inconsistencies and unstable motion. By aligning diffusion models using pairwise epipolar geometry constraints via preference-based optimization, the approach enforces geometric principles mathematically without requiring end-to-end differentiability, leading to more realistic 3D-consistent video generation.
Business Value
Enables the creation of more realistic and geometrically sound synthetic videos, benefiting applications in virtual reality, gaming, film, and robotics simulation.