arxiv_cv 95% Match Research Paper Robotics Researchers,AI Engineers,Embodied AI Researchers,Machine Learning Engineers 5 days ago

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

robotics › manipulation

📄 Abstract

Abstract: Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.

Authors (11)

Hao Li

Shuai Yang

Yilun Chen

Xinyi Chen

Xiaoda Yang

Yang Tian

+5 more

Submitted

June 24, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

CronusVLA proposes a unified framework to extend single-frame VLA models to the multi-frame paradigm efficiently. It uses a two-stage training process: single-frame pretraining for embodied foundation and multi-frame post-training to aggregate historical information via feature chunking, overcoming computational overhead and latency issues.

Business Value

Enables more capable and robust robotic systems for tasks requiring understanding of dynamic environments and temporal sequences, such as assembly, logistics, and service robotics.

Paper Metadata

Innovation Type

Algorithmic/Framework

Deployment Feasibility

Moderate. Requires integration with robotic hardware and potentially significant computational resources for inference, though designed to be more efficient.

Limitations Addressed

Constraint of single-frame image paradigm in VLA models,Failure to fully leverage temporal information,Substantial computational overhead and inference latency from directly feeding multiple frames into VLMs

Performance Gains

Effectively addresses computational overhead and latency, leading to improved performance in multi-frame VLA tasks.

Technical Tags

vision-language-action (VLA)robotic manipulationmulti-frame modelingembodied AIfeature aggregationautoregressive predictionpre-trained VLMstemporal information

Research Topics

Robotic ManipulationEmbodied AIVision-Language ModelsTemporal Reasoning

Methods & Architectures

CronusVLA frameworkTwo-stage process (single-frame pretraining, multi-frame post-training)Autoregressive prediction of action tokensFeature chunking for historical information aggregationAdapting VLM backbone from discrete tokens to learnable features Vision-Language Models (VLMs)Diffusion Models (implied by VLA context)

Applications & Tasks

Robotics Embodied AI Human-Robot Interaction Robotic Manipulation Task PlanningLeveraging Temporal InformationEfficient VLA Model Adaptation Improving robotic manipulation performanceEnabling robots to understand and act based on multi-frame visual contextReducing computational overhead in multi-frame VLA models

Datasets & Benchmarks

Datasets

Embodied datasets

Related Fields

RoboticsArtificial IntelligenceComputer VisionNatural Language ProcessingReinforcement Learning

Keywords

robotic manipulationvision-language-actionmulti-frametemporal reasoningembodied AIdiffusion modelsVLMsroboticsaction predictionfeature aggregation

Academic Context

#Robotic Manipulation#Embodied AI#Vision-Language Models#Temporal Reasoning

Commercial Potential

Potential Products

Advanced Robotic Control SystemsAI Assistants for Industrial AutomationSimulation Platforms for Robotics Training

Target Industries

ManufacturingLogisticsWarehousingHealthcare (assistive robotics)Automotive

Use Case Examples

Robots performing complex assembly tasksWarehouse robots navigating and manipulating objectsService robots assisting humans in dynamic environments

Competitive Edge

Offers a more efficient and effective approach to multi-frame VLA modeling for robotics compared to naive extensions of single-frame models.

Market Opportunity

Rapidly growing market for intelligent automation and robotics.

Revenue Models

Licensing of control softwareintegration servicesrobotics-as-a-service.

Resource Requirements

Compute Needs

Moderate to High (depending on VLM size and multi-frame context)

Data Requirements

Large-scale embodied datasets with visual, language, and action information.

Deployment Constraints

Integration with robotic hardware, real-time processing capabilities, robustness in real-world conditions.

Scalability

Designed to be more efficient than naive multi-frame approaches, suggesting potential for scalability.

Regulatory Considerations

Safety standards for robotic systems.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years (for robust deployment in robotics)

Patent Potential

Moderate (novel framework and training strategy)

View Full Paper Back to Papers