Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper Robotics researchers,AI engineers,Autonomous systems developers 2 weeks ago

VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation

robotics › navigation
📄 Abstract

Abstract: A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples semantic planning from embodiment grounding: a generalist planner learns from diverse, open-world data, while a specialist affordance model learns the robot's physical constraints and capabilities in safe, low-cost simulation. We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. Our real-world experiments show that VAMOS achieves higher success rates in both indoor and complex outdoor navigation than state-of-the-art model-based and end-to-end learning methods. We also show that our hierarchical design enables cross-embodied navigation across legged and wheeled robots and is easily steerable using natural language. Real-world ablations confirm that the specialist model is key to embodiment grounding, enabling a single high-level planner to be deployed across physically distinct wheeled and legged robots. Finally, this model significantly enhances single-robot reliability, achieving 3X higher success rates by rejecting physically infeasible plans. Website: https://vamos-vla.github.io/
Authors (12)
Mateo Guaman Castro
Sidharth Rajagopal
Daniel Gorbatov
Matt Schmittle
Rohan Baijal
Octi Zhang
+6 more
Submitted
October 23, 2025
arXiv Category
cs.RO
arXiv PDF

Key Contributions

VAMOS introduces a hierarchical Vision-Language-Action (VLA) model that effectively decouples semantic planning from embodiment grounding. This separation allows a generalist planner to learn from diverse data while a specialist affordance model adapts to specific robot capabilities, leading to improved generalization and performance in robot navigation across diverse environments and embodiments.

Business Value

Enables more adaptable and versatile robots that can navigate complex and varied environments, reducing the need for task-specific retraining and increasing operational efficiency in logistics, exploration, and service industries.