SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu
2025-10-14
Summary
This paper focuses on improving how we 'teach' diffusion large language models, which are a newer type of AI that can generate text, to follow our instructions and perform tasks well.
What's the problem?
Training these diffusion models using reinforcement learning, a method where the AI learns through trial and error and rewards, is difficult. Unlike older AI models, it's hard to figure out how likely the model is to produce a good answer, which is crucial for giving it feedback. Previous attempts to solve this used estimations that weren't very accurate and could lead the AI to learn the wrong things.
What's the solution?
The researchers developed a new technique called the Sandwiched Policy Gradient. This method cleverly uses both an upper and lower limit to estimate how good the model's answers are, giving a more reliable signal for learning than previous methods. It's like giving the AI a more precise range of feedback instead of just a guess.
Why it matters?
This new technique significantly improves the performance of diffusion models on challenging tasks like solving math problems and logic puzzles. The improvements are substantial, showing that this method is a big step forward in making these models more useful and accurate, potentially leading to better AI assistants and problem solvers.
Abstract
Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.