arxiv_ai 95% Match Research paper Robotics researchers,AI researchers,ML engineers working on embodied agents,Control engineers 2 weeks ago

Learning Affordances at Inference-Time for Vision-Language-Action Models

robotics › robotics-rl

📄 Abstract

Abstract: Solving complex real-world control tasks often takes multiple tries: if we fail at first, we reflect on what went wrong, and change our strategy accordingly to avoid making the same mistake. In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks, but lack the ability to contextually and dynamically readjust behavior when they fail to accomplish a task. In this work, we introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences by including them in-context, allowing it to learn the affordances and capabilities of the low-level VLA. Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution and draws useful conclusions to be included in future reasoning contexts. Unlike similar approaches to self-refinement in non-robotics domains, LITEN must reflect on unstructured real-world robot trajectories (e.g., raw videos), which requires structured guiderails during assessment. Our experimental results demonstrate LITEN is able to effectively learn from past experience to generate plans that use high-affordance instructions to accomplish long-horizon tasks.

Authors (6)

Ameesh Shah

William Chen

Adwait Godbole

Federico Mora

Sanjit A. Seshia

Sergey Levine

Submitted

October 22, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

This paper introduces LITEN (Learning from Inference-Time Execution), a novel approach for Vision-Language-Action (VLA) models in robotics that enables them to learn affordances and dynamically readjust behavior after failures. It iterates between planning/execution and reflection phases, using past experiences in-context to improve future reasoning and task success.

Business Value

Enables robots to become more robust, adaptable, and efficient in performing complex real-world tasks, reducing the need for extensive pre-training and manual intervention.

Paper Metadata

Innovation Type

Novel learning framework for VLA models

Deployment Feasibility

Medium, requires integration with robotic hardware and sophisticated VLA models; real-time adaptation is computationally intensive.

Limitations Addressed

The inability of current VLA models to contextually and dynamically readjust their behavior when they fail to accomplish a task, unlike humans who learn from mistakes.

Performance Gains

Implicitly aims for higher task success rates and faster adaptation compared to non-adaptive VLA models.

Technical Tags

Vision-Language-Action (VLA) modelsroboticsinference-time learningaffordancesself-refinementtask executionplan generationlow-level policyhigh-level VLMin-context learning

Research Topics

Robotics controlEmbodied AIVision-language models in roboticsReinforcement learning for roboticsRobotic task adaptation

Methods & Architectures

Learning from Inference-Time Execution (LITEN)Iterative reasoning and assessment phasesIn-context learning with past experiencesConditioning high-level VLM on low-level policy execution Vision-Language-Action (VLA) modelsVision-Language Models (VLMs)

Applications & Tasks

Robotics Embodied AI Real-world control tasks Robots adapting behavior after failureLearning affordances dynamicallyImproving task success rates through reflectionContextual readjustment of robot actions solving complex control taskslearning from execution failuresgenerating and executing plansadapting robot behavior at inference time

Related Fields

RoboticsReinforcement LearningComputer VisionNatural Language ProcessingEmbodied AIAI Safety

Keywords

RoboticsVLAVision-Language-ActionInference-Time LearningAffordancesSelf-RefinementAdaptationFailure RecoveryRobotic ControlEmbodied AIVLMIn-context LearningPlanningExecution

Academic Context

#Robotics control#Embodied AI#Vision-language models in robotics#Reinforcement learning for robotics#Robotic task adaptation

Commercial Potential

Potential Products

Adaptive robotic systemsAI assistants for complex manipulation tasksRobots capable of learning from real-world interaction

Target Industries

ManufacturingLogisticsWarehousingDomestic RoboticsExploration

Use Case Examples

A robot learning to pick up a new object after failing the first timeA robot adjusting its grasp based on object slippageAn autonomous vehicle learning from near-miss incidents

Competitive Edge

Advances VLA models by enabling dynamic, inference-time adaptation and learning from failure, a capability lacking in many current systems.

Market Opportunity

Rapidly growing market for advanced robotics and autonomous systems.

Revenue Models

Licensing of adaptive robotics softwaredevelopment of specialized robotic solutions.

Resource Requirements

Compute Needs

Significant compute for VLA model inference, planning, and iterative refinement.

Data Requirements

Real-world interaction data, potentially simulated environments for initial training.

Deployment Constraints

Real-time processing demands, safety considerations for autonomous robots, robustness to sensor noise and environmental changes.

Scalability

Scalability depends on the efficiency of the VLA model and the planning/reflection loop.

Regulatory Considerations

Safety standards for autonomous robotsethical implications of AI learning from potentially harmful actions.

Production Readiness

Maturity Level

Research/Prototype

Time to Market

Medium to Long-term (3-7 years)

View Full Paper Back to Papers