EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models
Atula Tejaswi, Litu Rout, Constantine Caramanis, Sanjay Shakkottai, Sujay Sanghavi
2026-02-05
Summary
This paper explores how to improve the process of guiding large language models, specifically those built using diffusion techniques, to generate text that scores well according to a separate 'reward' model. Think of it like teaching a model to write stories that a judge would like.
What's the problem?
When you try to guide these language models using rewards, it's tricky because they output words one at a time, and you can't directly calculate how a small change in the model's settings would affect the reward. Previous attempts either try to make the word choices continuous (which confuses the reward model because it wasn't trained on those kinds of inputs) or use a shortcut that doesn't accurately reflect the impact of changes, leading to incorrect adjustments.
What's the solution?
The researchers developed a new method called EntRGi, which stands for Entropy aware Reward Guidance. It cleverly adjusts how much the reward model influences the language model based on how confident the language model is in its own predictions. If the model is very sure about a word, EntRGi lets the reward model have more influence. This provides better guidance while still giving the reward model sensible inputs to work with.
Why it matters?
This work is important because it makes reward-guided language models more effective. This means we can better control the kind of text these models generate, making them more useful for tasks where we want specific qualities in the output, like helpfulness, creativity, or factual accuracy. It improves performance on complex tasks requiring multiple skills.
Abstract
Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens. Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits. Our key innovation is to go beyond this tradeoff by introducing a novel mechanism called EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model's confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model. We empirically validate our approach on a 7B-parameter diffusion language model across 3 diverse reward models and 3 multi-skill benchmarks, showing consistent improvements over state-of-the-art methods.