Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin

2025-11-03

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

Summary

This paper introduces a new way to help robots learn tasks by combining what they 'see' (vision), what they 'hear' (language), and what actions they take. It focuses on improving how robots predict what will happen next in the world and what actions they should perform, using a technique called world modeling.

What's the problem?

Robots often struggle when trying to predict both what will happen next visually *and* what actions to take to make it happen. These are different types of information – images versus commands – and it’s hard for a robot to understand how they relate to each other. Existing methods try to force these different types of data into a single, unified representation, which can cause problems and limit performance.

What's the solution?

The researchers developed a system called DUST, which stands for DUal-STream diffusion. Instead of forcing everything together, DUST keeps the visual and action information separate in different 'streams' but still allows them to learn from each other. They use a special type of neural network called a diffusion transformer and a clever training method that adds noise to each stream independently. This allows the robot to learn how things change over time in a more natural way, and also allows for more flexible planning during task execution by letting the vision and action parts of the prediction evolve at different speeds.

Why it matters?

This work is important because it significantly improves a robot’s ability to learn and perform complex tasks, both in simulated environments and in the real world. The improvements in success rates, especially with a real robot arm, demonstrate the practical value of this approach. Furthermore, the ability to pre-train the system using videos without actions suggests that robots could learn a lot about the world just by watching, which could lead to even more capable and adaptable robots in the future.

Abstract

Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.

View Paper