Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Open-world 3D scene understanding is a critical challenge that involves
recognizing and distinguishing diverse objects and categories from 3D data,
such as point clouds, without relying on manual annotations. Traditional
methods struggle with this open-world task, especially due to the limitations
of constructing extensive point cloud-text pairs and handling multimodal data
effectively. In response to these challenges, we present UniPLV, a robust
framework that unifies point clouds, images, and text within a single learning
paradigm for comprehensive 3D scene understanding. UniPLV leverages images as a
bridge to co-embed 3D points with pre-aligned images and text in a shared
feature space, eliminating the need for labor-intensive point cloud-text pair
crafting. Our framework achieves precise multimodal alignment through two
innovative strategies: (i) Logit and feature distillation modules between
images and point clouds to enhance feature coherence; (ii) A vision-point
matching module that implicitly corrects 3D semantic predictions affected by
projection inaccuracies from points to pixels. To further boost performance, we
implement four task-specific losses alongside a two-stage training strategy.
Extensive experiments demonstrate that UniPLV significantly surpasses
state-of-the-art methods, with average improvements of 15.6% and 14.8% in
semantic segmentation for Base-Annotated and Annotation-Free tasks,
respectively. These results underscore UniPLV's efficacy in pushing the
boundaries of open-world 3D scene understanding. We will release the code to
support future research and development.