Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: 3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text
prompts, which is essential for applications such as robotics. However,
existing 3DVG methods encounter two main challenges: first, they struggle to
handle the implicit representation of spatial textures in 3D Gaussian Splatting
(3DGS), making per-scene training indispensable; second, they typically require
larges amounts of labeled data for effective training. To this end, we propose
\underline{G}rounding via \underline{V}iew \underline{R}etrieval (GVR), a novel
zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D
retrieval task that leverages object-level view retrieval to collect grounding
clues from multiple views, which not only avoids the costly process of 3D
annotation, but also eliminates the need for per-scene training. Extensive
experiments demonstrate that our method achieves state-of-the-art visual
grounding performance while avoiding per-scene training, providing a solid
foundation for zero-shot 3DVG research. Video demos can be found in
https://github.com/leviome/GVR_demos.