DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang

2025-12-18

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Summary

This paper explores a new way to build models that can understand both images and text, moving away from older methods and achieving better results with less training data and faster processing.

What's the problem?

Current models that combine images and text, called diffusion vision language models, aren't performing as well as other popular models. This is likely because the underlying language models they use aren't powerful enough. The researchers wondered if they could build these models using already strong text-generating models, instead of starting from scratch with diffusion models.

What's the solution?

The researchers created a new family of models called DiffusionVL that can be built on top of existing, powerful text models. They essentially 'translate' these text models into a system that can also understand images through a process called fine-tuning. They also improved how the model generates longer responses, making it faster by reusing information it's already processed.

Why it matters?

This work is important because it shows that you can get excellent performance in image and text understanding by building on existing technology, rather than needing to create everything new. It also demonstrates a more efficient way to build these models, requiring significantly less data and running faster, which makes them more practical for real-world applications.

Abstract

In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.

View Paper