Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Existing research on 3D Large Language Models (LLMs) still struggles to
achieve grounded question-answering, primarily due to the under-exploration of
the mechanism of human-like scene-object grounded reasoning. This paper bridges
the gap by presenting a novel framework. We first introduce a grounded
Chain-of-Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex
reasoning task into simpler and manageable problems, and building corresponding
visual clues based on multimodal expert modules. To enable such a method, we
develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset,
consisting of 185K high-quality instances. Extensive experiments across various
complex 3D scene reasoning benchmarks demonstrate that our new framework
achieves strong performance with high grounding-QA coherence. To the best of
our knowledge, this is the first successful application of CoT reasoning to 3D
scene understanding, enabling step-by-step human-like reasoning and showing
potential for extension to broader 3D scene understanding scenarios.
Authors (5)
Xiongkun Linghu
Jiangyong Huang
Ziyu Zhu
Baoxiong Jia
Siyuan Huang
Submitted
October 19, 2025
Key Contributions
Presents a novel framework for grounded Chain-of-Thought (CoT) reasoning in 3D scenes, decoupling complex tasks and using multimodal expert modules to generate visual clues. Introduces SCENECOT-185K, the first large-scale dataset for grounded CoT reasoning in 3D, achieving strong performance and high grounding-QA coherence.
Business Value
Enables AI systems to understand and reason about complex 3D environments more effectively, crucial for advanced robotics, AR/VR applications, and intelligent spatial assistants.