Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: 3D visual grounding (3DVG) is challenging due to the need to understand 3D
spatial relations. While supervised approaches have achieved superior
performance, they are constrained by the scarcity and high annotation costs of
3D vision-language datasets. Training-free approaches based on LLMs/VLMs
eliminate the need for large-scale training data, but they either incur
prohibitive grounding time and token costs or have unsatisfactory accuracy. To
address the challenges, we introduce a novel method for training-free 3D visual
grounding, namely Language-to-Space Programming (LaSP). LaSP introduces
LLM-generated codes to analyze 3D spatial relations among objects, along with a
pipeline that evaluates and optimizes the codes automatically. Experimental
results demonstrate that LaSP achieves 52.9% accuracy on the Nr3D benchmark,
ranking among the best training-free methods. Moreover, it substantially
reduces the grounding time and token costs, offering a balanced trade-off
between performance and efficiency.