Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States
Qinglin Zhu, Yizhen Yao, Runcong Zhao, Yanzheng Xiang, Amrutha Saseendran, Chen Jin, Philip Alexander Teare, Bin Liang, Yulan He, Lin Gui
2025-10-14
Summary
This paper introduces a new method called Latent Refinement Decoding (LRD) to improve how quickly and accurately computers generate text, like writing code or solving math problems.
What's the problem?
Currently, the best way to generate text involves building it up word by word, which is slow. Newer, faster methods inspired by image generation try to create text all at once, but they often lose important information along the way or make decisions too early without considering the whole picture, leading to less accurate results.
What's the solution?
LRD works in two steps. First, it keeps all possible options for each word in mind, blending them together to form a more complete understanding of what the text should be. Then, it confidently chooses the best words one by one, but importantly, it keeps the uncertain words around for further refinement. This process continues until the text is finalized, using a mathematical measure to determine when it’s good enough to stop. Essentially, it's a smarter way to build text in parallel, avoiding premature decisions and information loss.
Why it matters?
This research is important because it offers a significantly faster way to generate text without sacrificing accuracy. The experiments show improvements in tasks like coding and math problem-solving, and the speed increases are substantial, potentially making these technologies much more practical for real-world applications.
Abstract
Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalized tokens are discarded at each step, and premature commitment, where local decisions are made without sufficient global coordination. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalizes confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) show that LRD improves accuracy while delivering speedups of up to 10.6x, making it a strong and versatile alternative for parallel sequence generation.