Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 90% Match Research Paper 3D Computer Vision Researchers,Robotics Engineers,AR/VR Developers,Machine Learning Researchers 1 week ago

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

computer-vision › 3d-vision
📄 Abstract

Abstract: Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
Authors (7)
Yujia Zhang
Xiaoyang Wu
Yixing Lao
Chengyao Wang
Zhuotao Tian
Naiyan Wang
+1 more
Submitted
October 27, 2025
arXiv Category
cs.CV
Neural Information Processing Systems 2025
arXiv PDF

Key Contributions

Concerto introduces a minimalist yet effective approach to learning spatial representations by combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. This method learns more coherent and informative spatial features, outperforming standalone 2D/3D self-supervised models and achieving new state-of-the-art results on 3D scene understanding benchmarks, demonstrating its ability to generalize and recall concepts from single modalities.

Business Value

Enables more robust 3D perception for applications like robotics, AR/VR, and autonomous systems, leading to more intelligent and context-aware agents. Improved spatial understanding can enhance user experiences in virtual environments.