arxiv_cv 94% Match Research Paper Robotics Engineers,Autonomous Driving Researchers,AR/VR Developers,Computer Vision Scientists 2 weeks ago

OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion

computer-vision › scene-understanding

📄 Abstract

Abstract: Understanding 3D scenes is pivotal for autonomous driving, robotics, and augmented reality. Recent semantic Gaussian Splatting approaches leverage large-scale 2D vision models to project 2D semantic features onto 3D scenes. However, they suffer from two major limitations: (1) insufficient contextual cues for individual masks during preprocessing and (2) inconsistencies and missing details when fusing multi-view features from these 2D models. In this paper, we introduce \textbf{OpenInsGaussian}, an \textbf{Open}-vocabulary \textbf{Ins}tance \textbf{Gaussian} segmentation framework with Context-aware Cross-view Fusion. Our method consists of two modules: Context-Aware Feature Extraction, which augments each mask with rich semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to mitigate alignment errors and incompleteness. Through extensive experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin. These findings underscore the robustness and generality of our proposed approach, marking a significant step forward in 3D scene understanding and its practical deployment across diverse real-world scenarios.

Authors (6)

Tianyu Huang

Runnan Chen

Dongting Hu

Fengming Huang

Mingming Gong

Tongliang Liu

Submitted

October 21, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces OpenInsGaussian, an open-vocabulary instance Gaussian segmentation framework with context-aware cross-view fusion. It addresses limitations in contextual cues and multi-view fusion by employing attention mechanisms for selective feature aggregation.

Business Value

Enables more comprehensive and accurate 3D scene understanding for applications like autonomous navigation and robotic manipulation, leading to improved safety and task performance. Facilitates richer AR experiences.

Paper Metadata

Innovation Type

Algorithmic Framework

Deployment Feasibility

Moderate, requires significant computational resources for Gaussian Splatting and feature fusion.

Limitations Addressed

Insufficient contextual cues for individual masks and inconsistencies/missing details in multi-view feature fusion from 2D vision models used in semantic Gaussian Splatting.

Technical Tags

instance segmentationGaussian Splattingopen-vocabulary3D scenescross-view fusioncontext-aware featuresattention mechanismsautonomous drivingroboticsaugmented reality

Research Topics

3D Scene UnderstandingComputer VisionGenerative ModelsRobotics Perception

Methods & Architectures

context-aware feature extractionattention-driven feature aggregationcross-view fusion OpenInsGaussian

Applications & Tasks

Autonomous Driving Robotics Augmented Reality 3D Reconstruction Instance Segmentation3D Scene RepresentationMulti-view Fusion Open-vocabulary instance segmentation in 3DImproving 3D scene understanding from multiple views

Datasets & Benchmarks

Datasets

benchmark datasets

Benchmarks

state-of-the-art results

Related Fields

Computer VisionRobotics3D GraphicsMachine Learning

Keywords

instance segmentationGaussian Splatting3D scene understandingopen-vocabularycross-view fusioncontext-awareattentionautonomous drivingroboticsaugmented realityOpenInsGaussian

Academic Context

#3D Scene Understanding#Computer Vision#Generative Models#Robotics Perception

Commercial Potential

Potential Products

3D scene reconstruction softwareRobotic perception systemsAR/VR content creation tools

Target Industries

AutomotiveRoboticsGamingArchitectureConstruction

Use Case Examples

Enabling robots to accurately identify and segment objects in their environmentCreating detailed 3D models of real-world scenes for simulation or AR overlay

Competitive Edge

Advances semantic Gaussian Splatting by incorporating open-vocabulary instance segmentation and improved cross-view fusion techniques.

Market Opportunity

Growing markets for 3D sensing, robotics, and AR/VR.

Revenue Models

Licensing of core technologydevelopment of specialized applications.

Resource Requirements

Compute Needs

High, particularly for rendering and feature aggregation.

Data Requirements

Multi-view 3D scene data with instance-level annotations.

Deployment Constraints

Computational cost and real-time performance requirements.

Scalability

Scalability depends on the efficiency of the Gaussian Splatting rendering and the attention mechanism.

Production Readiness

Maturity Level

Research

Time to Market

3-5 years for robust applications

Licensing

TBD.

Patent Potential

Moderate for the context-aware fusion and attention mechanisms.

View Full Paper Back to Papers