Scaling Behavior of Discrete Diffusion Language Models

Dimitri von Rütte, Janis Fluri, Omead Pooladzandi, Bernhard Schölkopf, Thomas Hofmann, Antonio Orvieto

2025-12-15

Scaling Behavior of Discrete Diffusion Language Models

Summary

This research investigates how well discrete diffusion language models, a newer type of language model, perform as they get bigger and are trained with more data, comparing them to the more common autoregressive language models.

What's the problem?

Large language models require huge amounts of computing power and data to train. Understanding how performance improves as you increase these resources – known as scaling laws – is crucial for deciding which model types are most efficient. While discrete diffusion models are an alternative to the standard approach, it wasn't clear if they could scale as effectively, with some evidence suggesting they needed even *more* resources to achieve similar results.

What's the solution?

The researchers systematically tested how different types of 'noise' used in the diffusion process affected the model's scaling behavior. They experimented with varying the amount of data, the number of parameters in the model, and the batch size and learning rate during training. They ultimately trained a very large uniform diffusion model with 10 billion parameters, using a massive amount of computing power, to confirm their findings.

Why it matters?

The study found that the scaling behavior of these diffusion models is quite different from traditional models and heavily depends on the type of noise used. Specifically, uniform diffusion seems to be a promising approach when data is limited, potentially offering a more efficient way to build powerful language models in situations where getting lots of training data is difficult or expensive.

Abstract

Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for 10^{22} FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.

View Paper