Deep Self-Evolving Reasoning

Zihan Liu, Shun Zheng, Xumeng Wen, Yang Wang, Jiang Bian, Mao Yang

2025-10-21

Summary

This paper introduces a new method called Deep Self-Evolving Reasoning (DSER) to improve the problem-solving abilities of smaller, publicly available large language models, specifically focusing on complex math problems.

What's the problem?

Large language models are getting better at reasoning, but the most effective techniques for solving really hard problems currently only work well with very large, privately owned models. Smaller, open-source models struggle because they aren't very good at checking their own work and correcting mistakes. This limits their ability to tackle challenging tasks like advanced math competition problems.

What's the solution?

The researchers realized that even if a model isn't great at self-correction, it can still improve over time if it repeatedly tries to solve a problem, and each attempt has a slightly higher chance of being better than the last. They treated the reasoning process like a series of steps, where each step either moves closer to the right answer or further away. By running many of these reasoning processes at the same time, they amplified the small improvements, allowing the model to eventually converge on the correct solution. They applied this to a model called DeepSeek-R1-0528-Qwen3-8B.

Why it matters?

This work shows that you can significantly boost the performance of smaller, accessible language models without needing to make them drastically larger or rely on proprietary techniques. It allows a relatively small model to perform better than a much larger one in some cases. More importantly, it helps us understand *why* these smaller models struggle with reasoning, pointing the way towards future research to build better, more self-sufficient AI systems.

Abstract

Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.

View Paper