Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: 3D instance segmentation is crucial for understanding complex 3D
environments, yet fully supervised methods require dense point-level
annotations, resulting in substantial annotation costs and labor overhead. To
mitigate this, box-level annotations have been explored as a weaker but more
scalable form of supervision. However, box annotations inherently introduce
ambiguity in overlapping regions, making accurate point-to-instance assignment
challenging. Recent methods address this ambiguity by generating pseudo-masks
through training a dedicated pseudo-labeler in an additional training stage.
However, such two-stage pipelines often increase overall training time and
complexity, hinder end-to-end optimization. To overcome these challenges, we
propose BEEP3D-Box-supervised End-to-End Pseudo-mask generation for 3D instance
segmentation. BEEP3D adopts a student-teacher framework, where the teacher
model serves as a pseudo-labeler and is updated by the student model via an
Exponential Moving Average. To better guide the teacher model to generate
precise pseudo-masks, we introduce an instance center-based query refinement
that enhances position query localization and leverages features near instance
centers. Additionally, we design two novel losses-query consistency loss and
masked feature consistency loss-to align semantic and geometric signals between
predictions and pseudo-masks. Extensive experiments on ScanNetV2 and S3DIS
datasets demonstrate that BEEP3D achieves competitive or superior performance
compared to state-of-the-art weakly supervised methods while remaining
computationally efficient.
Key Contributions
Proposes BEEP3D, an end-to-end framework for 3D instance segmentation using only box-level supervision. It employs a student-teacher approach to generate pseudo-masks, overcoming the ambiguity of box annotations and avoiding the multi-stage complexity of prior methods.
Business Value
Significantly reduces the effort and cost associated with annotating 3D data for tasks like scene understanding in robotics or autonomous driving. Enables more scalable development of 3D perception systems.