Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong

2025-12-30

Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Summary

This paper introduces new AI models, Dream-VL and Dream-VLA, that combine vision and language understanding with the ability to plan and take actions, particularly for robots.

What's the problem?

Existing AI models that generate responses step-by-step, like many large language models, struggle with complex tasks that require planning ahead or controlling robots in real-time because of that sequential nature. They can be slow and inefficient when dealing with dynamic situations or needing to consider multiple possibilities at once.

What's the solution?

The researchers built their models on a different type of AI architecture called 'diffusion-based' models, which aren't limited to generating things in a strict order. Dream-VL focuses on understanding images and language, while Dream-VLA adds the ability to control actions. They trained these models using lots of publicly available data, and specifically trained Dream-VLA on robotic datasets to improve its performance in robotic tasks. The key is that the diffusion approach allows for more flexible and faster planning and action generation.

Why it matters?

These new models perform as well as, or even better than, existing AI systems on various tests, especially when it comes to visual planning and robotic control. This is important because it could lead to robots that are more adaptable, efficient, and capable of handling complex real-world scenarios. The researchers are also sharing their models with the AI community to encourage further development in this area.

Abstract

While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as π_0 and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.

View Paper