Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Despite large-scale pretraining endowing models with language and vision
reasoning capabilities, improving their spatial reasoning capability remains
challenging due to the lack of data grounded in the 3D world. While it is
possible for humans to manually create immersive and interactive worlds through
3D graphics, as seen in applications such as VR, gaming, and robotics, this
process remains highly labor-intensive. In this paper, we propose a scalable
method for generating high-quality 3D environments that can serve as training
data for foundation models. We recast 3D environment building as a sequential
decision-making problem, employing Vision-Language-Models (VLMs) as policies
that output actions to jointly craft a 3D environment's layout, materials,
lighting, and assets. Our proposed framework, 3D-Generalist, trains VLMs to
generate more prompt-aligned 3D environments via self-improvement fine-tuning.
We demonstrate the effectiveness of 3D-Generalist and the proposed training
strategy in generating simulation-ready 3D environments. Furthermore, we
demonstrate its quality and scalability in synthetic data generation by
pretraining a vision foundation model on the generated data. After fine-tuning
the pre-trained model on downstream tasks, we show that it surpasses models
pre-trained on meticulously human-crafted synthetic data and approaches results
achieved with real data orders of magnitude larger.