Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: This paper presents a novel approach for pretraining robotic manipulation
Vision-Language-Action (VLA) models using a large corpus of unscripted
real-life video recordings of human hand activities. Treating human hand as
dexterous robot end-effector, we show that "in-the-wild" egocentric human
videos without any annotations can be transformed into data formats fully
aligned with existing robotic V-L-A training data in terms of task granularity
and labels. This is achieved by the development of a fully-automated holistic
human activity analysis approach for arbitrary human hand videos. This approach
can generate atomic-level hand activity segments and their language
descriptions, each accompanied with framewise 3D hand motion and camera motion.
We process a large volume of egocentric videos and create a hand-VLA training
dataset containing 1M episodes and 26M frames. This training data covers a wide
range of objects and concepts, dexterous manipulation tasks, and environment
variations in real life, vastly exceeding the coverage of existing robot data.
We design a dexterous hand VLA model architecture and pretrain the model on
this dataset. The model exhibits strong zero-shot capabilities on completely
unseen real-world observations. Additionally, fine-tuning it on a small amount
of real robot action data significantly improves task success rates and
generalization to novel objects in real robotic experiments. We also
demonstrate the appealing scaling behavior of the model's task performance with
respect to pretraining data scale. We believe this work lays a solid foundation
for scalable VLA pretraining, advancing robots toward truly generalizable
embodied intelligence.
Authors (17)
Qixiu Li
Yu Deng
Yaobo Liang
Lin Luo
Lei Zhou
Chengtang Yao
+11 more
Submitted
October 24, 2025
Key Contributions
This paper presents a novel approach for pretraining robotic manipulation VLA models using large-scale, unscripted real-life human activity videos. It develops an automated analysis method to transform egocentric human videos into data formats aligned with robotic VLA training, creating a large dataset (1M episodes, 26M frames) covering diverse manipulation tasks.
Business Value
Accelerates the development of more capable and versatile robotic manipulation systems by providing a scalable and effective pretraining strategy using real-world human data.