arxiv_cv 90% Match Research Paper 3D Computer Vision Researchers,Robotics Engineers,AR/VR Developers,Machine Learning Researchers 1 week ago

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

computer-vision › 3d-vision

📄 Abstract

Abstract: Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

Authors (7)

Yujia Zhang

Xiaoyang Wu

Yixing Lao

Chengyao Wang

Zhuotao Tian

Naiyan Wang

+1 more

Submitted

October 27, 2025

arXiv Category

cs.CV

Neural Information Processing Systems 2025

arXiv PDF

Key Contributions

Concerto introduces a minimalist yet effective approach to learning spatial representations by combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. This method learns more coherent and informative spatial features, outperforming standalone 2D/3D self-supervised models and achieving new state-of-the-art results on 3D scene understanding benchmarks, demonstrating its ability to generalize and recall concepts from single modalities.

Business Value

Enables more robust 3D perception for applications like robotics, AR/VR, and autonomous systems, leading to more intelligent and context-aware agents. Improved spatial understanding can enhance user experiences in virtual environments.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Feasible for applications requiring 3D scene understanding, especially where large labeled datasets are scarce. Integration into existing 3D pipelines is possible.

Limitations Addressed

Difficulty in learning robust spatial representations from 3D data alone,Gap between 2D and 3D perception capabilities,Need for effective self-supervised methods for 3D scene understanding

Performance Gains

Outperforms standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8% respectively in linear probing; sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet).

Technical Tags

self-supervised learningspatial representations2D-3D joint embedding3D intra-modal self-distillationpoint cloudsscene understandingzero-shot learningCLIP projection

Research Topics

Representation Learning3D Computer VisionSelf-Supervised LearningMultimodal LearningScene Understanding

Methods & Architectures

3D intra-modal self-distillation2D-3D cross-modal joint embeddingLinear probingCLIP projection Concerto

Applications & Tasks

Robotics Augmented Reality Virtual Reality 3D Reconstruction Learning Spatial RepresentationsBridging 2D and 3D PerceptionImproving 3D Scene Understanding Learning coherent and informative spatial features3D scene perceptionVideo-lifted point cloud spatial understanding

Datasets & Benchmarks

Datasets

ScanNet

Benchmarks

ScanNet (80.7% mIoU)

mIoU (mean Intersection over Union)

Related Fields

Computer Vision3D VisionMachine LearningSelf-Supervised LearningRobotics

Keywords

3D VisionSelf-Supervised LearningSpatial RepresentationPoint CloudsScene UnderstandingMultimodal LearningRoboticsAR/VRDeep LearningConcerto

Academic Context

#Representation Learning#3D Computer Vision#Self-Supervised Learning#Multimodal Learning#Scene Understanding

Commercial Potential

Potential Products

3D scene understanding modules for robotsSpatial reasoning engines for AR/VR applicationsTools for 3D content creation and analysis

Target Industries

RoboticsGamingAugmented RealityVirtual RealityArchitectureManufacturing

Use Case Examples

Enabling robots to better understand and navigate 3D environmentsCreating more immersive and interactive AR/VR experiencesAutomating the analysis of 3D scans for design or inspection

Competitive Edge

Achieves superior performance in 3D scene understanding by effectively combining 2D and 3D modalities and leveraging self-distillation, surpassing methods that rely solely on one modality or simpler feature concatenation.

Market Opportunity

Growing market for 3D sensing, AR/VR, and robotics.

Revenue Models

Licensing technologyproviding AI services for 3D applications.

Resource Requirements

Compute Needs

Moderate to high, depending on the scale of 3D data and training duration.

Data Requirements

Large-scale 3D datasets (e.g., point clouds, scans) and potentially paired 2D images.

Deployment Constraints

Computational resources for real-time 3D processing,Availability of 3D sensor data

Scalability

The method is designed to learn representations that can be applied to various 3D tasks, suggesting good scalability.

Regulatory Considerations

Lowunless applied in safety-critical systems like autonomous vehicles.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

Moderate, for the novel combination of self-distillation and cross-modal embedding techniques.

View Full Paper Back to Papers