Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent vision-language-action (VLA) models built on pretrained
vision-language models (VLMs) have demonstrated strong performance in robotic
manipulation. However, these models remain constrained by the single-frame
image paradigm and fail to fully leverage the temporal information offered by
multi-frame histories, as directly feeding multiple frames into VLM backbones
incurs substantial computational overhead and inference latency. We propose
CronusVLA, a unified framework that extends single-frame VLA models to the
multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame
pretraining on large-scale embodied datasets with autoregressive prediction of
action tokens, establishing an effective embodied vision-language foundation;
(2) Multi-frame post-training, which adapts the prediction of the
vision-language backbone from discrete tokens to learnable features, and
aggregates historical information via feature chunking. CronusVLA effectively
addresses the existing challenges of multi-frame modeling while enhancing
performance and observational robustness. To evaluate the robustness under
temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel
benchmark featuring 24 types of observational disturbances and 120 severity
levels. Experiments across three embodiments in simulated and real-world
environments demonstrate that CronusVLA achieves leading performance and
superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8%
improvement over OpenVLA on LIBERO, and the highest robustness score on
SimplerEnv-OR. These results highlight the potential of efficient multi-frame
adaptation in VLA models for more powerful and robust real-world deployment.
Authors (11)
Hao Li
Shuai Yang
Yilun Chen
Xinyi Chen
Xiaoda Yang
Yang Tian
+5 more
Key Contributions
CronusVLA proposes a unified framework to extend single-frame VLA models to the multi-frame paradigm efficiently. It uses a two-stage training process: single-frame pretraining for embodied foundation and multi-frame post-training to aggregate historical information via feature chunking, overcoming computational overhead and latency issues.
Business Value
Enables more capable and robust robotic systems for tasks requiring understanding of dynamic environments and temporal sequences, such as assembly, logistics, and service robotics.