arxiv_cv 90% Match Research Paper Robotics Researchers,AI Researchers in Generative Models,Machine Learning Engineers 2 days ago

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

generative-ai › diffusion

📄 Abstract

Abstract: Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.

Authors (5)

John Won

Kyungmin Lee

Huiwon Jang

Dongyoung Kim

Jinwoo Shin

Submitted

October 31, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

DUST is a novel world-model augmented VLA framework that uses a dual-stream diffusion architecture to jointly predict next-state observations and action sequences. It addresses modality conflicts by maintaining separate streams with independent noise perturbations and a decoupled flow-matching loss, enabling bidirectional learning without a unified latent space.

Business Value

Enables more capable and adaptable robots that can better understand and interact with their environment, leading to advancements in automation and human-robot collaboration.

Paper Metadata

Innovation Type

Architectural Innovation

Deployment Feasibility

Moderate, requires significant computational resources for training and inference, and integration into robotic control systems.

Limitations Addressed

Difficulty in jointly predicting next-state observations and action sequences,Inherent difference between vision and action modalities,Modality conflict in VLAs,Need for a unified latent space

Performance Gains

Large gains on vision-centric benchmarks (averaging 34.7% over base model)

Technical Tags

diffusion modelsvision-language-action (VLA)world modelingrobotic policy learningmultimodal transformermodality streamsnoise perturbationsflow matching

Research Topics

Generative ModelsRoboticsVision-Language ModelsWorld ModelingReinforcement Learning

Methods & Architectures

DUal-STream diffusion (DUST)Multimodal Diffusion TransformerIndependent Noise PerturbationsDecoupled Flow-Matching Loss Diffusion TransformerMultimodal Transformer

Applications & Tasks

Robotics Embodied AI Human-Robot Interaction Joint Prediction of Next-State and ActionsModality Conflict in VLAsRobotic Policy LearningWorld Model Augmentation Robotic Policy LearningVision-Language-Action ModelingWorld Model Prediction

Related Fields

Deep LearningGenerative ModelsRobotics ControlReinforcement Learning

Keywords

diffusion modelsVLAworld modelroboticspolicy learningmultimodaltransformerflow matchinggenerative AIembodied AI

Academic Context

#Generative Models#Robotics#Vision-Language Models#World Modeling#Reinforcement Learning

Commercial Potential

Potential Products

Robotic control softwareSimulation environments for robot trainingAI assistants for complex tasks

Target Industries

RoboticsManufacturingLogisticsHealthcare

Use Case Examples

Robots learning complex manipulation tasksAutonomous agents navigating and interacting in simulated environmentsPersonalized robotic assistance

Competitive Edge

Provides a novel approach to integrating world models with VLA models using diffusion, potentially outperforming existing methods that struggle with modality differences.

Market Opportunity

Growing, driven by advancements in robotics and AI.

Revenue Models

Licensing of the technologydevelopment of specialized robotic systems.

Resource Requirements

Compute Needs

High, requires significant GPU resources for training diffusion models.

Data Requirements

Large-scale datasets of robot interactions, visual observations, and actions.

Deployment Constraints

Real-time performance requirements for robotic control, computational cost.

Scalability

Scalability depends on the efficiency of the diffusion model and transformer architecture.

Production Readiness

Maturity Level

Research

Time to Market

3-5 years for robust deployment in real-world robotics.

Patent Potential

Moderate, for the dual-stream diffusion architecture.

View Full Paper Back to Papers