arxiv_cv 90% Match Research Paper Robotics Engineers,AI Researchers,Computer Vision Scientists,3D Graphics Developers 1 week ago

ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models

computer-vision › scene-understanding

📄 Abstract

Abstract: Understanding and reasoning about complex 3D environments requires structured scene representations that capture not only objects but also their semantic and spatial relationships. While recent works on 3D scene graph generation have leveraged pretrained VLMs without task-specific fine-tuning, they are largely confined to single-view settings, fail to support incremental updates as new observations arrive and lack explicit geometric grounding in 3D space, all of which are essential for embodied scenarios. In this paper, we propose, ZING-3D, a framework that leverages the vast knowledge of pretrained foundation models to enable open-vocabulary recognition and generate a rich semantic representation of the scene in a zero-shot manner while also enabling incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications. Our approach leverages VLM reasoning to generate a rich 2D scene graph, which is grounded in 3D using depth information. Nodes represent open-vocabulary objects with features, 3D locations, and semantic context, while edges capture spatial and semantic relations with inter-object distances. Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.

Authors (2)

Pranav Saxena

Jimmy Chiun

Submitted

October 24, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

ZING-3D introduces a framework that leverages foundation models for zero-shot, incremental 3D scene graph generation with geometric grounding. This addresses limitations of existing methods by enabling open-vocabulary recognition, handling new observations, and integrating 3D spatial information, crucial for embodied robotics applications.

Business Value

Enables robots and AI systems to better understand and interact with complex 3D environments in real-time, improving navigation, manipulation, and task completion in fields like autonomous driving, logistics, and smart manufacturing.

Paper Metadata

Innovation Type

Novel Framework

Deployment Feasibility

Moderate. Requires integration with foundation models and potentially specialized hardware for real-time 3D processing, but the framework's design aims for applicability in embodied scenarios.

Limitations Addressed

Confined to single-view settings,Inability to support incremental updates,Lack of explicit geometric grounding in 3D space

Technical Tags

3D Scene GraphsVision-Language ModelsZero-Shot LearningIncremental LearningGeometric GroundingOpen-Vocabulary RecognitionEmbodied AIRoboticsFoundation ModelsSemantic Relationships

Research Topics

3D Scene UnderstandingVision-Language IntegrationRobotics PerceptionKnowledge RepresentationContinual Learning

Methods & Architectures

Vision-Language Model (VLM) reasoningScene Graph GenerationDepth Information Integration Foundation ModelsVision-Language Models (VLMs)

Applications & Tasks

Robotics 3D Environment Understanding Augmented Reality Virtual Reality Scene Graph GenerationIncremental Scene UpdatesGeometric ReasoningOpen-Vocabulary Object Recognition 3D Scene Graph GenerationIncremental 3D Scene UnderstandingObject Relationship PredictionSpatial Reasoning

Related Fields

Computer VisionNatural Language ProcessingRoboticsArtificial Intelligence

Keywords

3D Scene GraphVision-Language ModelZero-Shot LearningIncremental LearningGeometric GroundingRoboticsEmbodied AIFoundation ModelsScene UnderstandingSpatial ReasoningOpen-Vocabulary3D Perception

Academic Context

#3D Scene Understanding#Vision-Language Integration#Robotics Perception#Knowledge Representation#Continual Learning

Commercial Potential

Potential Products

Intelligent Robotic Systems3D Environment Mapping ToolsAR/VR Content Generation Platforms

Target Industries

RoboticsAutomotiveLogisticsGamingArchitecture

Use Case Examples

Autonomous robot navigation in dynamic warehousesVirtual environment creation for trainingAssisted assembly in manufacturing

Competitive Edge

Offers a more comprehensive solution than existing methods by integrating zero-shot capabilities, incremental updates, and 3D geometric grounding, which are critical for real-world embodied applications.

Market Opportunity

Growing market for AI in robotics and 3D understanding.

Revenue Models

Licensing of technologyintegration into commercial robotic platformsSaaS for 3D scene analysis.

Resource Requirements

Compute Needs

Likely high, due to reliance on large foundation models and 3D processing.

Data Requirements

Requires 3D scene data, potentially with annotations for objects, relationships, and depth.

Deployment Constraints

Real-time performance in dynamic environments, computational resources, integration with robotic hardware.

Scalability

Scalability depends on the efficiency of the underlying foundation models and the 3D scene graph representation.

Production Readiness

Maturity Level

Research Prototype

Time to Market

2-4 years

Patent Potential

Moderate, for novel architectural components or integration methods.

View Full Paper Back to Papers