Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 90% Match Research Paper Robotics researchers,ML engineers,AI scientists,Robotics developers 1 week ago

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

robotics › manipulation
📄 Abstract

Abstract: This paper presents a novel approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous manipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data significantly improves task success rates and generalization to novel objects in real robotic experiments. We also demonstrate the appealing scaling behavior of the model's task performance with respect to pretraining data scale. We believe this work lays a solid foundation for scalable VLA pretraining, advancing robots toward truly generalizable embodied intelligence.
Authors (17)
Qixiu Li
Yu Deng
Yaobo Liang
Lin Luo
Lei Zhou
Chengtang Yao
+11 more
Submitted
October 24, 2025
arXiv Category
cs.RO
arXiv PDF

Key Contributions

This paper presents a novel approach for pretraining robotic manipulation VLA models using large-scale, unscripted real-life human activity videos. It develops an automated analysis method to transform egocentric human videos into data formats aligned with robotic VLA training, creating a large dataset (1M episodes, 26M frames) covering diverse manipulation tasks.

Business Value

Accelerates the development of more capable and versatile robotic manipulation systems by providing a scalable and effective pretraining strategy using real-world human data.