Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Open-vocabulary 3D object detection has gained significant interest due to
its critical applications in autonomous driving and embodied AI. Existing
detection methods, whether offline or online, typically rely on dense point
cloud reconstruction, which imposes substantial computational overhead and
memory constraints, hindering real-time deployment in downstream tasks. To
address this, we propose a novel reconstruction-free online framework tailored
for memory-efficient and real-time 3D detection. Specifically, given streaming
posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual
foundation model (VFM) for single-view 3D object detection by bounding boxes,
coupled with CLIP to capture open-vocabulary semantics of detected objects. To
fuse all detected bounding boxes across different views into a unified one, we
employ an association module for correspondences of multi-views and an
optimization module to fuse the 3D bounding boxes of the same instance
predicted in multi-views. The association module utilizes 3D Non-Maximum
Suppression (NMS) and a box correspondence matching module, while the
optimization module uses an IoU-guided efficient random optimization technique
based on particle filtering to enforce multi-view consistency of the 3D
bounding boxes while minimizing computational complexity. Extensive experiments
on ScanNetV2 and CA-1M datasets demonstrate that our method achieves
state-of-the-art performance among online methods. Benefiting from this novel
reconstruction-free paradigm for 3D object detection, our method exhibits great
generalization abilities in various scenarios, enabling real-time perception
even in environments exceeding 1000 square meters.