Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li

2025-02-17

Summary

This paper talks about LLaDA, a new type of AI language model that uses a method called diffusion instead of the usual autoregressive approach. It's designed to challenge the idea that autoregressive models are the best way to create large language models.

What's the problem?

Most big AI language models use a method called autoregression, which generates text one word at a time in order. Some researchers think this might limit how well these models can understand and generate language, especially when it comes to tasks that require thinking about words in reverse order.

What's the solution?

The researchers created LLaDA, which uses a diffusion approach instead of autoregression. This new model learns by first hiding parts of the text and then figuring out how to fill in the blanks. They trained LLaDA from scratch and tested it on many different language tasks. Surprisingly, LLaDA performed as well as or better than traditional models, even on tasks that usually favor autoregressive models.

Why it matters?

This matters because it shows there might be better ways to build AI language models than what we're currently using. LLaDA's success, especially in tasks like writing poems backward, suggests that diffusion models could lead to AI that understands language more flexibly and deeply. This could result in more capable and versatile AI assistants, better language translation tools, and new ways for computers to process and generate human language.

Abstract

Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.

View Paper