Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Generating 3D scenes is still a challenging task due to the lack of readily
available scene data. Most existing methods only produce partial scenes and
provide limited navigational freedom. We introduce a practical and scalable
solution that uses 360{\deg} video as an intermediate scene representation,
capturing the full-scene context and ensuring consistent visual content
throughout the generation. We propose WorldPrompter, a generative pipeline that
synthesizes traversable 3D scenes from text prompts. WorldPrompter incorporates
a conditional 360{\deg} panoramic video generator, capable of producing a
128-frame video that simulates a person walking through and capturing a virtual
environment. The resulting video is then reconstructed as Gaussian splats by a
fast feedforward 3D reconstructor, enabling a true walkable experience within
the 3D scene. Experiments demonstrate that our panoramic video generation
model, trained with a mix of image and video data, achieves convincing spatial
and temporal consistency for static scenes. This is validated by an average
COLMAP matching rate of 94.6\%, allowing for high-quality panoramic Gaussian
splat reconstruction and improved navigation throughout the scene. Qualitative
and quantitative results also show it outperforms the state-of-the-art
360{\deg} video generators and 3D scene generation models.