arxiv_cv 95% Match Research Paper Robotics Researchers,AI Researchers,Machine Learning Engineers 2 weeks ago

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

robotics › embodied-agents

📄 Abstract

Abstract: Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

Authors (27)

GigaBrain Team

Angen Ye

Boyuan Wang

Chaojun Ni

Guan Huang

Guosheng Zhao

+21 more

Submitted

October 22, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

GigaBrain-0 is a novel VLA foundation model that significantly reduces reliance on expensive real-world robot data by leveraging diverse data generated from world models. It improves cross-task generalization, policy robustness through RGBD input modeling, and reasoning capabilities via embodied Chain-of-Thought supervision, enabling more capable and data-efficient generalist robots.

Business Value

Accelerates the development and deployment of more capable and versatile robots by drastically reducing the data collection bottleneck, leading to wider adoption in various industries.

Paper Metadata

Innovation Type

Data Generation Strategy

Deployment Feasibility

Moderate. Requires significant computational resources for training and inference, and integration with robotic hardware. The use of generated data might introduce sim-to-real gaps.

Limitations Addressed

The prohibitive cost and time required for collecting large-scale real-world robot data, which limits the scalability and generalization of current VLA systems.

Technical Tags

Vision-Language-Action (VLA)foundation modelworld modelsgenerated datacross-task generalizationRGBD inputembodied Chain-of-Thought (CoT)robot data efficiency

Research Topics

RoboticsAI Foundation ModelsReinforcement LearningEmbodied AIData Augmentation

Methods & Architectures

world model-generated dataRGBD input modelingembodied Chain-of-Thought (CoT) supervisionfoundation model training Vision-Language-Action (VLA) modelWorld Models

Applications & Tasks

Generalist Robots Humanoid Robots Service Robots Industrial Automation High cost of real-world robot data collectionLimited scalability and generalization of current VLA systemsInefficiency of physical data collection Generalist Robot ControlVision-Language-Action tasksTask Generalization

Related Fields

RoboticsMachine LearningAI Foundation ModelsReinforcement LearningComputer Vision

Keywords

VLAfoundation modelworld modelsgenerated dataroboticsembodied AIgeneralizationdata efficiencyChain-of-ThoughtRGBD

Academic Context

#Robotics#AI Foundation Models#Reinforcement Learning#Embodied AI#Data Augmentation

Commercial Potential

Potential Products

General-purpose robot control softwareAI platforms for robot development

Target Industries

RoboticsManufacturingLogisticsHealthcareConsumer Electronics

Use Case Examples

Robots performing complex tasks in homes or factoriesAutonomous agents in simulated environments for trainingRobots assisting humans with diverse tasks

Competitive Edge

Addresses the data bottleneck in VLA model training more effectively than methods relying solely on real-world data.

Resource Requirements

Compute Needs

Very high compute for training foundation models, significant compute for inference.

Data Requirements

Leverages world model-generated data, reducing reliance on real robot data, but still requires initial real data for world model training.

Deployment Constraints

Computational resources,Real-time control requirements,Hardware integration

Scalability

Designed for scalability through data generation, but model size and inference speed are factors.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers