Discrete Diffusion in Large Language and Multimodal Models: A Survey

Runpeng Yu, Qi Li, Xinchao Wang

2025-06-17

Discrete Diffusion in Large Language and Multimodal Models: A Survey

Summary

This paper talks about Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs), which are new types of AI models designed to generate text and multimodal content faster than traditional methods. Instead of generating words one by one like autoregressive models, these models use a process called discrete diffusion, which gradually refines noisy data back into clear text or combined text and images all at once. They also use a full attention mechanism to better focus on all parts of the input simultaneously during generation.

What's the problem?

The problem is that traditional language models generate outputs step-by-step, which can be slow and inefficient, especially when dealing with very long texts or complex multimodal tasks involving images and text. This sequential process limits the speed and scalability of AI systems when they need to produce high-quality outputs quickly.

What's the solution?

The solution was to develop discrete diffusion models that can denoise or fix corrupted (noisy or masked) versions of text and multimodal data in parallel, rather than sequentially. By using discrete steps that model the corruption and denoising of tokens and combining this with full attention across all tokens, these models can generate content faster while maintaining or improving accuracy compared to the traditional autoregressive models.

Why it matters?

This matters because faster and more efficient generation methods allow AI systems to handle larger, more complex problems like long documents, conversations, or multimodal content involving images and text. Discrete diffusion models help make AI tools quicker and more powerful, supporting applications in writing, multimedia creation, and interactive AI where speed and quality are both crucial.

Abstract

Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs) enable parallel generation and faster inference compared to autoregressive models through denoising-based strategies and full attention mechanisms.

View Paper