DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang
2026-02-24
Summary
This paper focuses on improving how large language models (LLMs) solve complex problems using a technique called Reinforcement Learning with Verifiers (RLVR). It introduces a new method, DSDR, to help these models explore different reasoning paths more effectively.
What's the problem?
Current RLVR methods struggle with exploration. LLMs tend to get stuck using the same few reasoning strategies and don't deeply investigate alternative solutions. Simply adding randomness doesn't fix this because it doesn't encourage the model to try fundamentally different approaches, leading to weak learning signals and unstable results when the model learns in groups.
What's the solution?
The researchers developed DSDR, which stands for Dual-Scale Diversity Regularization. This method breaks down the idea of 'diversity' in reasoning into two parts: a broad, overall diversity and a more focused, local diversity. It encourages the model to find many *different* correct ways to solve a problem (global diversity) and, within each of those correct approaches, adds a small amount of randomness to prevent the model from getting stuck on a single, potentially flawed, step (local diversity). Importantly, the local randomness is only applied to steps that are already correct, and the amount of local randomness is adjusted based on how unique a solution path is.
Why it matters?
This work is important because it shows that encouraging diversity at both a high level (different solution strategies) and a low level (variations within a strategy) significantly improves the accuracy and reliability of LLMs when they're tackling challenging reasoning tasks. It provides a more principled way to guide the learning process and unlock the full potential of these powerful models.
Abstract
Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.