Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 92% Match Research Paper Robotics Researchers,Computer Vision Engineers,AR/VR Developers,Autonomous Driving Researchers 1 week ago

EA3D: Online Open-World 3D Object Extraction from Streaming Videos

computer-vision › 3d-vision
📄 Abstract

Abstract: Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.
Authors (6)
Xiaoyu Zhou
Jingqi Wang
Yuang Jia
Yongtao Wang
Deqing Sun
Ming-Hsuan Yang
Submitted
October 29, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

EA3D presents a unified online framework for open-world 3D object extraction from streaming videos, enabling simultaneous geometric reconstruction and scene understanding. It dynamically interprets frames using foundation models, integrates knowledge into a Gaussian feature map, and iteratively refines geometry and semantics, overcoming limitations of offline methods.

Business Value

Enables real-time 3D mapping and understanding of dynamic environments, crucial for autonomous systems, robotics, and immersive experiences.