Diffusion Language Models are Super Data Learners

Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh

2025-11-06

Diffusion Language Models are Super Data Learners

Summary

This paper investigates how different types of language models, specifically diffusion language models (DLMs) and autoregressive (AR) models, perform when trained on limited amounts of data. It finds that DLMs actually *outperform* AR models in these low-data situations, but this advantage changes as you give them more data or make the models bigger.

What's the problem?

Typically, autoregressive models are considered the standard for language modeling. However, when you don't have a huge dataset to train on, it's not always clear which type of model is better. The researchers wanted to understand if diffusion models, a newer approach, could be a better choice when data is scarce, and *why* that might be the case. They noticed that in some situations, diffusion models were doing surprisingly well, and wanted to figure out what was causing this.

What's the solution?

The researchers ran a lot of experiments comparing DLMs and AR models under controlled conditions, varying things like the amount of data, the size of the model, and the training time. They discovered that DLMs benefit from longer training when data is limited. They believe this is because DLMs model information in a more flexible way, use computation more efficiently through a process of iterative refinement, and naturally add a bit of randomness during training which helps them generalize better. They showed that a relatively small DLM could achieve impressive results even with a limited dataset, even surpassing a larger AR model trained with the same amount of effort.

Why it matters?

This research is important because it shows that diffusion models aren't just a theoretical curiosity – they can be a practical alternative to autoregressive models, especially when you're working with limited data. This is really useful in situations where collecting large datasets is expensive or impossible, like when working with specialized programming languages or niche areas of knowledge. It also suggests that how we measure model performance during training (like looking at validation loss) might need to be re-evaluated when using these types of models.

Abstract

Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.

View Paper