Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper Robotics Researchers,AI Engineers,Embodied AI Researchers,Machine Learning Engineers 5 days ago

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

robotics › manipulation
📄 Abstract

Abstract: Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
Authors (11)
Hao Li
Shuai Yang
Yilun Chen
Xinyi Chen
Xiaoda Yang
Yang Tian
+5 more
Submitted
June 24, 2025
arXiv Category
cs.RO
arXiv PDF

Key Contributions

CronusVLA proposes a unified framework to extend single-frame VLA models to the multi-frame paradigm efficiently. It uses a two-stage training process: single-frame pretraining for embodied foundation and multi-frame post-training to aggregate historical information via feature chunking, overcoming computational overhead and latency issues.

Business Value

Enables more capable and robust robotic systems for tasks requiring understanding of dynamic environments and temporal sequences, such as assembly, logistics, and service robotics.