Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Solving complex real-world control tasks often takes multiple tries: if we
fail at first, we reflect on what went wrong, and change our strategy
accordingly to avoid making the same mistake. In robotics,
Vision-Language-Action models (VLAs) offer a promising path towards solving
complex control tasks, but lack the ability to contextually and dynamically
readjust behavior when they fail to accomplish a task. In this work, we
introduce Learning from Inference-Time Execution (LITEN), which connects a VLA
low-level policy to a high-level VLM that conditions on past experiences by
including them in-context, allowing it to learn the affordances and
capabilities of the low-level VLA. Our approach iterates between a reasoning
phase that generates and executes plans for the low-level VLA, and an
assessment phase that reflects on the resulting execution and draws useful
conclusions to be included in future reasoning contexts. Unlike similar
approaches to self-refinement in non-robotics domains, LITEN must reflect on
unstructured real-world robot trajectories (e.g., raw videos), which requires
structured guiderails during assessment. Our experimental results demonstrate
LITEN is able to effectively learn from past experience to generate plans that
use high-affordance instructions to accomplish long-horizon tasks.
Authors (6)
Ameesh Shah
William Chen
Adwait Godbole
Federico Mora
Sanjit A. Seshia
Sergey Levine
Submitted
October 22, 2025
Key Contributions
This paper introduces LITEN (Learning from Inference-Time Execution), a novel approach for Vision-Language-Action (VLA) models in robotics that enables them to learn affordances and dynamically readjust behavior after failures. It iterates between planning/execution and reflection phases, using past experiences in-context to improve future reasoning and task success.
Business Value
Enables robots to become more robust, adaptable, and efficient in performing complex real-world tasks, reducing the need for extensive pre-training and manual intervention.