arxiv_cv 95% Match Research Paper Embodied AI researchers,Robotics engineers,Computer vision scientists,AR/VR developers,Audio-visual processing researchers 2 weeks ago

3D Audio-Visual Segmentation

computer-vision › 3d-vision

📄 Abstract

Abstract: Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still insufficient for real-world operation, as the mapping from 2D images to 3D scenes is missing. To address this fundamental limitation, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. This problem poses more challenges due to variations in camera extrinsics, audio scattering, occlusions, and diverse acoustics across sounding object categories. To facilitate this research, we create the very first simulation based benchmark, 3DAVS-S34-O7, providing photorealistic 3D scene environments with grounded spatial audio under single-instance and multi-instance settings, across 34 scenes and 7 object categories. This is made possible by re-purposing the Habitat simulator to generate comprehensive annotations of sounding object locations and corresponding 3D masks. Subsequently, we propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models synergistically with 3D visual scene representation through spatial audio-aware mask alignment and refinement. Extensive experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI. Project page: https://x-up-lab.github.io/research/3d-audio-visual-segmentation/

Authors (3)

Artem Sokolov

Swapnil Bhosale

Xiatian Zhu

Submitted

November 4, 2024

arXiv Category

cs.CV

arXiv PDF

Key Contributions

This paper introduces the novel research problem of 3D Audio-Visual Segmentation (3D-AVS), extending 2D AVS to the 3D output space by mapping audio signals to object masks within 3D scenes. It addresses fundamental limitations of 2D AVS by incorporating 3D geometry and spatial audio, and introduces the first simulation-based benchmark (3DAVS-S34-O7) to facilitate research in this area.

Business Value

Enables robots and AR/VR systems to better understand and interact with their environment by localizing and identifying sound-producing objects in 3D space, leading to more intuitive human-robot interaction and immersive experiences.

Paper Metadata

Innovation Type

Problem Formulation / Benchmark

Deployment Feasibility

Moderate. Requires synchronized audio-visual sensors and sophisticated 3D scene reconstruction capabilities. The benchmark facilitates research towards deployment.

Limitations Addressed

Mapping from 2D images to 3D scenes is missing in existing AVS,Insufficient for real-world operation due to lack of 3D spatial understanding,Challenges in 3D AVS due to variations in camera extrinsics, audio scattering, occlusions, and acoustics

Technical Tags

Audio-Visual Segmentation (AVS)3D SegmentationEmbodied AIRoboticsAR/VR/MRSpatial AudioSound Source LocalizationMultimodal AIBenchmark3D Scene Understanding

Research Topics

Embodied AIMultimodal Learning3D Computer VisionRoboticsAudio ProcessingScene Understanding

Methods & Architectures

3D Audio-Visual Segmentation frameworkBenchmark dataset creation (3DAVS-S34-O7)Simulation-based environment generationSpatial audio integration Deep Learning Models (for AVS)

Applications & Tasks

Robotics Augmented Reality (AR) Virtual Reality (VR) Mixed Reality (MR) Human-Robot Interaction Multimodal segmentation3D scene understandingSound source localizationEmbodied AI tasksBenchmark development 3D Audio-Visual SegmentationIdentifying sounding objects in 3D spaceMapping audio cues to 3D object masks

Datasets & Benchmarks

Datasets

3DAVS-S34-O7

Related Fields

Embodied AIComputer VisionRoboticsAudio ProcessingVirtual RealityAugmented RealityMultimodal AI

Keywords

3D Audio-Visual SegmentationAVS3D SegmentationEmbodied AIRoboticsAR/VR/MRSpatial AudioSound Source LocalizationMultimodal AIBenchmark3D Scene UnderstandingSounding objects

Academic Context

#Embodied AI#Multimodal Learning#3D Computer Vision#Robotics#Audio Processing#Scene Understanding

Commercial Potential

Potential Products

Robots capable of understanding sound sources in 3DAR/VR systems with enhanced spatial awarenessIntelligent audio-visual analysis tools

Target Industries

RoboticsGamingVirtual RealityAugmented RealityConsumer ElectronicsAutomotive

Use Case Examples

A robot identifying the source of a sound (e.g., a dropped object) and its location in 3D spaceAn AR system highlighting objects that are making noiseImproving spatial audio rendering in VR by understanding which objects are producing sounds

Competitive Edge

Extends existing Audio-Visual Segmentation to the 3D domain, addressing a critical gap for embodied AI and robotics applications by incorporating spatial audio and 3D scene geometry.

Resource Requirements

Compute Needs

High, for generating photorealistic 3D environments and training complex multimodal models.

Data Requirements

Requires 3D scene data with synchronized audio and visual information, including spatial audio properties.

Deployment Constraints

Requires synchronized multi-sensor systems (cameras, microphones) and significant computational power for real-time 3D processing.

Scalability

Scalable to complex 3D scenes, but computational cost increases with scene complexity and the number of sound sources.

View Full Paper Back to Papers