Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 94% Match Research Paper Robotics Engineers,Autonomous Driving Researchers,AR/VR Developers,Computer Vision Scientists 2 weeks ago

OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion

computer-vision › scene-understanding
📄 Abstract

Abstract: Understanding 3D scenes is pivotal for autonomous driving, robotics, and augmented reality. Recent semantic Gaussian Splatting approaches leverage large-scale 2D vision models to project 2D semantic features onto 3D scenes. However, they suffer from two major limitations: (1) insufficient contextual cues for individual masks during preprocessing and (2) inconsistencies and missing details when fusing multi-view features from these 2D models. In this paper, we introduce \textbf{OpenInsGaussian}, an \textbf{Open}-vocabulary \textbf{Ins}tance \textbf{Gaussian} segmentation framework with Context-aware Cross-view Fusion. Our method consists of two modules: Context-Aware Feature Extraction, which augments each mask with rich semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to mitigate alignment errors and incompleteness. Through extensive experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin. These findings underscore the robustness and generality of our proposed approach, marking a significant step forward in 3D scene understanding and its practical deployment across diverse real-world scenarios.
Authors (6)
Tianyu Huang
Runnan Chen
Dongting Hu
Fengming Huang
Mingming Gong
Tongliang Liu
Submitted
October 21, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces OpenInsGaussian, an open-vocabulary instance Gaussian segmentation framework with context-aware cross-view fusion. It addresses limitations in contextual cues and multi-view fusion by employing attention mechanisms for selective feature aggregation.

Business Value

Enables more comprehensive and accurate 3D scene understanding for applications like autonomous navigation and robotic manipulation, leading to improved safety and task performance. Facilitates richer AR experiences.