Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Humans learn abstract concepts through multisensory synergy, and once formed,
such representations can often be recalled from a single modality. Inspired by
this principle, we introduce Concerto, a minimalist simulation of human concept
learning for spatial cognition, combining 3D intra-modal self-distillation with
2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more
coherent and informative spatial features, as demonstrated by zero-shot
visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised
models by 14.2% and 4.8%, respectively, as well as their feature concatenation,
in linear probing for 3D scene perception. With full fine-tuning, Concerto sets
new SOTA results across multiple scene understanding benchmarks (e.g., 80.7%
mIoU on ScanNet). We further present a variant of Concerto tailored for
video-lifted point cloud spatial understanding, and a translator that linearly
projects Concerto representations into CLIP's language space, enabling
open-world perception. These results highlight that Concerto emerges spatial
representations with superior fine-grained geometric and semantic consistency.
Authors (7)
Yujia Zhang
Xiaoyang Wu
Yixing Lao
Chengyao Wang
Zhuotao Tian
Naiyan Wang
+1 more
Submitted
October 27, 2025
Neural Information Processing Systems 2025
Key Contributions
Concerto introduces a minimalist yet effective approach to learning spatial representations by combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. This method learns more coherent and informative spatial features, outperforming standalone 2D/3D self-supervised models and achieving new state-of-the-art results on 3D scene understanding benchmarks, demonstrating its ability to generalize and recall concepts from single modalities.
Business Value
Enables more robust 3D perception for applications like robotics, AR/VR, and autonomous systems, leading to more intelligent and context-aware agents. Improved spatial understanding can enhance user experiences in virtual environments.