Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Semantic Scene Completion (SSC) has emerged as a pivotal approach for jointly
learning scene geometry and semantics, enabling downstream applications such as
navigation in mobile robotics. The recent generalization to Panoptic Scene
Completion (PSC) advances the SSC domain by integrating instance-level
information, thereby enhancing object-level sensitivity in scene understanding.
While PSC was introduced using LiDAR modality, methods based on camera images
remain largely unexplored. Moreover, recent Transformer-based approaches
utilize a fixed set of learned queries to reconstruct objects within the scene
volume. Although these queries are typically updated with image context during
training, they remain static at test time, limiting their ability to
dynamically adapt specifically to the observed scene. To overcome these
limitations, we propose IPFormer, the first method that leverages
context-adaptive instance proposals at train and test time to address
vision-based 3D Panoptic Scene Completion. Specifically, IPFormer adaptively
initializes these queries as panoptic instance proposals derived from image
context and further refines them through attention-based encoding and decoding
to reason about semantic instance-voxel relationships. Extensive experimental
results show that our approach achieves state-of-the-art in-domain performance,
exhibits superior zero-shot generalization on out-of-domain data, and achieves
a runtime reduction exceeding 14x. These results highlight our introduction of
context-adaptive instance proposals as a pioneering effort in addressing
vision-based 3D Panoptic Scene Completion.
Authors (7)
Markus Gross
Aya Fahmy
Danit Niwattananan
Dominik Muhle
Rui Song
Daniel Cremers
+1 more
Neural Information Processing Systems (NeurIPS) 2025
Key Contributions
This paper introduces IPFormer, the first method for vision-based 3D Panoptic Scene Completion (PSC) that utilizes context-adaptive instance proposals at both training and test time. This approach overcomes the limitations of static queries in existing Transformer-based methods by allowing dynamic adaptation to the observed scene, leading to improved scene geometry and semantics learning with enhanced object-level sensitivity.
Business Value
Enables more sophisticated scene understanding for robots and AR systems, leading to improved navigation, interaction, and scene reconstruction capabilities in complex environments.