arxiv_ai 90% Match Research Paper Robotics Researchers,AI Researchers,Embodied AI Developers 2 weeks ago

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

robotics › embodied-agents

📄 Abstract

Abstract: Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.

Authors (13)

Zhengshen Zhang

Hao Li

Yalun Dai

Zhengbang Zhu

Lei Zhou

Chenchen Liu

+7 more

Submitted

October 20, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

Introduces FALCON, a novel paradigm for Vision-Language-Action (VLA) models that injects rich 3D spatial tokens into the action head, leveraging spatial foundation models for strong geometric priors from RGB alone. It enhances spatial reasoning without compromising language understanding or requiring specialized sensors, addressing limitations in current VLA architectures.

Business Value

Enables more capable and adaptable robots for tasks requiring precise spatial understanding and interaction in real-world 3D environments, such as autonomous navigation, manipulation, and human-robot collaboration.

Paper Metadata

Innovation Type

Architectural Innovation

Deployment Feasibility

Moderate to High, depending on sensor availability (RGB alone is supported, depth/pose optional) and integration complexity.

Limitations Addressed

Addresses the spatial reasoning gap in existing VLA models built on 2D encoders, poor generalization/adaptability, and issues with 3D integration techniques that require specialized sensors or inject weak geometric cues.

Performance Gains

Not explicitly quantified in the abstract, but aims for improved generalization and adaptability.

Technical Tags

Vision-Language-Action (VLA)3D Spatial ReasoningFoundation ModelsRGB-D FusionAction HeadGeometric PriorsEmbodied AISpatial Foundation Models

Research Topics

Embodied AIVision-Language-Action Models3D Spatial UnderstandingRobotics PerceptionFoundation Model Integration

Methods & Architectures

Injecting 3D spatial tokensSpatial Foundation ModelsEmbodied Spatial ModelSpatial-Enhanced Action Head Vision-Language-Action (VLA) modelsTransformer-based models

Applications & Tasks

Robotics Embodied AI 3D Environment Interaction Spatial Reasoning Gap in VLAsLimited GeneralizationPoor AdaptabilityWeak Geometric PriorsVision-Language Alignment Degradation Grounding actions in 3D spaceImproving VLA generalizationEnabling robots to understand and act in 3D environments

Related Fields

RoboticsComputer VisionNatural Language ProcessingEmbodied AIFoundation Models

Keywords

VLAEmbodied AISpatial Reasoning3D VisionFoundation ModelsRoboticsAction GenerationGeometric PriorsRGB-DTransformerFALCON

Academic Context

#Embodied AI#Vision-Language-Action Models#3D Spatial Understanding#Robotics Perception#Foundation Model Integration

Commercial Potential

Potential Products

Advanced robotic control systemsAutonomous navigation softwareInteractive AI agents

Target Industries

RoboticsAutomotiveLogisticsManufacturingConsumer Electronics

Use Case Examples

Robots performing complex manipulation tasks in unstructured environments.Autonomous vehicles with enhanced spatial awareness.Virtual agents interacting realistically in 3D spaces.

Competitive Edge

Aims to overcome limitations of existing VLA models by integrating strong spatial priors directly into the action generation process, potentially offering better generalization and adaptability.

Market Opportunity

Large (growing robotics and AI markets)

Revenue Models

Indirect (enabling advanced robotic capabilities)

Resource Requirements

Compute Needs

Likely high (training VLA models)

Data Requirements

Requires datasets with 3D scene information, actions, and corresponding language descriptions.

Deployment Constraints

Availability of sensors (RGB, optional depth/pose),Computational resources for inference

Scalability

The architecture aims for adaptability, suggesting potential for scalability.

Production Readiness

Maturity Level

Research

Time to Market

Medium to Long (requires significant integration and testing in robotic systems)

View Full Paper Back to Papers