arxiv_ai 95% Match Research Paper Robotics researchers,AI engineers,Autonomous systems developers 2 weeks ago

VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation

robotics › navigation

📄 Abstract

Abstract: A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples semantic planning from embodiment grounding: a generalist planner learns from diverse, open-world data, while a specialist affordance model learns the robot's physical constraints and capabilities in safe, low-cost simulation. We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. Our real-world experiments show that VAMOS achieves higher success rates in both indoor and complex outdoor navigation than state-of-the-art model-based and end-to-end learning methods. We also show that our hierarchical design enables cross-embodied navigation across legged and wheeled robots and is easily steerable using natural language. Real-world ablations confirm that the specialist model is key to embodiment grounding, enabling a single high-level planner to be deployed across physically distinct wheeled and legged robots. Finally, this model significantly enhances single-robot reliability, achieving 3X higher success rates by rejecting physically infeasible plans. Website: https://vamos-vla.github.io/

Authors (12)

Mateo Guaman Castro

Sidharth Rajagopal

Daniel Gorbatov

Matt Schmittle

Rohan Baijal

Octi Zhang

+6 more

Submitted

October 23, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

VAMOS introduces a hierarchical Vision-Language-Action (VLA) model that effectively decouples semantic planning from embodiment grounding. This separation allows a generalist planner to learn from diverse data while a specialist affordance model adapts to specific robot capabilities, leading to improved generalization and performance in robot navigation across diverse environments and embodiments.

Business Value

Enables more adaptable and versatile robots that can navigate complex and varied environments, reducing the need for task-specific retraining and increasing operational efficiency in logistics, exploration, and service industries.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High, as it leverages simulation for training specialist models and demonstrates real-world performance, suggesting practical applicability.

Limitations Addressed

Difficulty in generalizing robot policies across diverse environments,Conforming to unique physical constraints of different embodiments,Lack of separation between high-level planning and low-level grounding

Performance Gains

Achieves higher success rates than state-of-the-art model-based and end-to-end learning methods.

Technical Tags

hierarchical learningvision-language-actionrobot navigationembodiment groundingsemantic planningaffordance modelingimage space planningsimulationreal-world experimentsgeneralization

Research Topics

Embodied AIRoboticsReinforcement LearningComputer VisionNatural Language Processing

Methods & Architectures

Hierarchical LearningVision-Language-Action (VLA) modelAffordance ModelingEnd-to-end learningModel-based learning Hierarchical VLA model

Applications & Tasks

Robotics Autonomous Systems Generalization in NavigationEmbodiment ConstraintsCross-Embodiment Transfer Robot NavigationSteerable NavigationCapability-Modulated Navigation

Related Fields

Artificial IntelligenceMachine LearningRoboticsComputer VisionNatural Language Processing

Keywords

robot navigationembodied AIvision-language-actionhierarchical learningaffordancesemantic planningrobot capabilitiesgeneralizationsimulationreal-world roboticsautonomous systemsmulti-modal learning

Academic Context

#Embodied AI#Robotics#Reinforcement Learning#Computer Vision#Natural Language Processing

Commercial Potential

Potential Products

Autonomous navigation systems for robotsRobotic platforms with adaptable navigation capabilities

Target Industries

RoboticsLogisticsWarehousingExplorationManufacturing

Use Case Examples

Robots navigating complex indoor warehousesAutonomous rovers exploring outdoor terrainsDrones performing surveillance in varied environments

Competitive Edge

Outperforms existing state-of-the-art model-based and end-to-end learning methods in robot navigation tasks by effectively handling embodiment constraints and generalizing across environments.

Resource Requirements

Compute Needs

Not specified, but likely requires significant GPU resources for training hierarchical models.

Data Requirements

Diverse, open-world data for the generalist planner and simulation data for the specialist affordance model.

Deployment Constraints

Real-world deployment requires robust sensors, actuators, and computational hardware on the robot.

Scalability

The hierarchical design suggests potential for scalability by training generalist and specialist models independently and composing them.

View Full Paper Back to Papers