Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts
Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi
2025-10-07
Summary
This paper investigates how to get the most out of diffusion-based large language models, which are powerful AI systems for generating text. It finds that these models have hidden strengths that aren't being fully used when they're put to work.
What's the problem?
Diffusion-based language models are really good at learning complex patterns in text, but when you actually *use* them to generate answers or complete tasks, their performance isn't always as good as it could be. The issue is that the way these models generate text – the order in which they fill in missing pieces – can significantly impact the results, and current methods usually stick to just one fixed approach, missing out on potential benefits.
What's the solution?
The researchers discovered that these models are secretly like a team of specialists, each good at generating text in a slightly different way depending on the order things are processed. They developed a new technique called HEX, which doesn't require any extra training. Instead, HEX runs the model multiple times, each time using a different generation order, and then combines the results using a simple majority vote. This way, it avoids the weaknesses of any single approach and leverages the strengths of all the 'specialists' within the model.
Why it matters?
This work is important because it dramatically improves the performance of these powerful language models without needing to retrain them. They showed significant gains on challenging reasoning tasks like math problems and scientific questions, even beating other advanced methods. It suggests a new way to think about using these models – focusing on *how* they generate text, not just *that* they generate text – and opens the door for further improvements in AI performance.
Abstract
Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.