arxiv_cv 92% Match Research Paper Robotics Researchers,Computer Vision Engineers,AR/VR Developers,Autonomous Driving Researchers 1 week ago

EA3D: Online Open-World 3D Object Extraction from Streaming Videos

computer-vision › 3d-vision

📄 Abstract

Abstract: Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.

Authors (6)

Xiaoyu Zhou

Jingqi Wang

Yuang Jia

Yongtao Wang

Deqing Sun

Ming-Hsuan Yang

Submitted

October 29, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

EA3D presents a unified online framework for open-world 3D object extraction from streaming videos, enabling simultaneous geometric reconstruction and scene understanding. It dynamically interprets frames using foundation models, integrates knowledge into a Gaussian feature map, and iteratively refines geometry and semantics, overcoming limitations of offline methods.

Business Value

Enables real-time 3D mapping and understanding of dynamic environments, crucial for autonomous systems, robotics, and immersive experiences.

Paper Metadata

Innovation Type

Unified Online Framework / Novel Integration

Deployment Feasibility

Moderate to High. Requires significant computational resources for real-time processing but offers a powerful solution for dynamic 3D scene analysis.

Limitations Addressed

Current 3D scene understanding methods rely on offline data or pre-built geometry. EA3D addresses the need for dynamic, online understanding from continuous video streams.

Technical Tags

3D Object ExtractionStreaming VideoOpen-World UnderstandingOnline ReconstructionGaussian SplattingVision-Language ModelsFoundation ModelsScene UnderstandingGeometric ReconstructionReal-time Processing

Research Topics

3D Scene UnderstandingVideo AnalysisGenerative AIComputer VisionRobotics Perception

Methods & Architectures

ExtractAnything3D (EA3D)Vision-Language Foundation EncodersGaussian Feature MapOnline Update StrategyIterative Visual Odometry EstimationRecurrent Joint Optimization Gaussian SplattingVision-Language Foundation Models

Applications & Tasks

Autonomous Driving Robotics Augmented Reality Virtual Reality 3D Content Creation Limitations of offline multi-view dataPre-constructed 3D geometry dependencyDynamic scene understanding from streaming video Online 3D object extractionHolistic scene understandingGeometric reconstruction from video

Related Fields

RoboticsComputer Vision3D GraphicsMachine LearningAutonomous Systems

Keywords

3D ReconstructionObject ExtractionStreaming VideoOnline LearningScene UnderstandingGaussian SplattingFoundation ModelsVision-LanguageAutonomous DrivingRoboticsReal-timeAR/VR

Academic Context

#3D Scene Understanding#Video Analysis#Generative AI#Computer Vision#Robotics Perception

Technology Stack

Frameworks & Libraries

Gaussian Splatting

Commercial Potential

Potential Products

Real-time 3D mapping systemsDynamic environment simulatorsRobotic perception modules

Target Industries

AutomotiveRoboticsGamingArchitectureConstruction

Use Case Examples

Creating 3D models of environments from drone footageEnabling robots to understand and interact with dynamic spacesGenerating real-time 3D scenes for AR applications

Competitive Edge

Offers a novel online, open-world approach to 3D object extraction and scene understanding from video, surpassing methods limited by offline data or static 3D models.

Resource Requirements

Compute Needs

High compute requirements for real-time processing, likely requiring powerful GPUs.

Data Requirements

Requires streaming video data as input.

Deployment Constraints

Real-time performance is critical and may be challenging on resource-constrained hardware. Handling diverse and complex real-world scenes can be difficult.

Scalability

The online update strategy suggests potential for scalability to longer video sequences. Performance may depend on the efficiency of the Gaussian feature map updates.

View Full Paper Back to Papers