arxiv_ai 95% Match Research Paper Robotics Researchers,AI Engineers,Embodied AI Researchers,ML Engineers 3 weeks ago

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

robotics › embodied-agents

📄 Abstract

Abstract: We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.

Authors (29)

Xinyi Chen

Yilun Chen

Yanwei Fu

Ning Gao

Jiaya Jia

Weiyang Jin

+23 more

Submitted

October 15, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

Introduces InternVLA-M1, a unified framework for spatial grounding and robot control that enables generalist robot policies. It uses a two-stage pipeline: spatial grounding pre-training to align instructions with visual positions, and spatially guided action post-training to generate embodiment-aware actions, significantly improving instruction-following performance.

Business Value

Enables the development of more versatile and intelligent robots capable of understanding and executing complex tasks based on natural language, accelerating automation in various industries.

Paper Metadata

Innovation Type

Framework/Methodology

Deployment Feasibility

Moderate, requires integration with robotic hardware and simulation environments, significant training data and compute.

Limitations Addressed

Addresses the challenge of creating scalable, general-purpose robot policies that can reliably follow natural language instructions and perform actions in diverse environments.

Performance Gains

+14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka compared to variants without spatial guidance.

Technical Tags

vision-language-actionrobot policyspatial groundinginstruction followinggeneralist robotstwo-stage pipelinespatial promptingembodiment-aware actionsSimplerEnvWidowXLIBERO

Research Topics

RoboticsEmbodied AIVision-Language ModelsReinforcement LearningHuman-Robot Interaction

Methods & Architectures

InternVLA-M1Spatially Guided Vision-Language-Action TrainingSpatial Grounding Pre-trainingSpatially Guided Action Post-trainingPlug-and-play Spatial Prompting Vision-Language-Action Framework

Applications & Tasks

Robotics Human-Robot Interaction Autonomous Systems Industrial Automation Scalable General-Purpose Robot ControlBridging Language Instructions and Physical ActionsImproving Spatial Reasoning in Robots Instruction FollowingRobot ControlSpatial ReasoningTask Execution

Datasets & Benchmarks

Benchmarks

SimplerEnv Google Robot • WidowX • LIBERO

Instruction following accuracySpatial reasoning capability

Related Fields

RoboticsArtificial IntelligenceComputer VisionNatural Language ProcessingEmbodied AIReinforcement Learning

Keywords

roboticsvision-language-actionspatial groundinginstruction followingrobot policyembodied AIgeneralist robotshuman-robot interactionInternVLA-M1robot controlspatial prompting

Academic Context

#Robotics#Embodied AI#Vision-Language Models#Reinforcement Learning#Human-Robot Interaction

Companies & Organizations

Companies Mentioned

Google

Commercial Potential

Potential Products

General-purpose robot control softwareAI platforms for robotic task automationRobots capable of complex instruction following

Target Industries

RoboticsManufacturingLogisticsHealthcareConsumer Electronics

Use Case Examples

A robot that can assemble products based on verbal instructions.A household robot that can fetch items based on descriptions.Autonomous navigation systems that understand complex commands.

Competitive Edge

Offers a novel spatially guided training approach that significantly improves the performance and spatial reasoning capabilities of vision-language-action models for robotics.

Market Opportunity

Rapidly growing market for intelligent automation and robotics.

Revenue Models

Licensing of the frameworkintegration into robotic platformsdevelopment of specialized robot applications.

Resource Requirements

Compute Needs

High compute requirements for training large vision-language-action models, especially with large datasets.

Data Requirements

Large datasets of robot interactions, language instructions, and corresponding actions/positions.

Deployment Constraints

Integration with physical robot hardware, real-world variability, safety considerations.

Scalability

Aims for generalist policies, implying scalability across different tasks and environments.

Regulatory Considerations

Safety standards for autonomous robots.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for integration into advanced robotic systems.

Patent Potential

Moderate, for the spatial grounding and prompting techniques.

View Full Paper Back to Papers