DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution
Shengda Fan, Xuyan Ye, Yankai Lin
2026-01-21
Summary
This paper introduces a new method, called DARC, for improving artificial intelligence systems by having them essentially 'teach themselves' through a process called self-play, using large language models.
What's the problem?
When AI systems learn by playing against themselves, it can be unstable and lead to poor results. This happens for two main reasons: first, the 'questions' one part of the AI asks (the Questioner) change as the other part (the Solver) gets better, making it hard to set consistent goals. Second, the AI uses its own answers to learn, which can reinforce mistakes if those initial answers aren't very good.
What's the solution?
DARC tackles these problems in two steps. First, it trains the Questioner to create questions that are specifically designed to be at different difficulty levels, using outside information to help. Then, it trains the Solver using a 'teacher-student' approach where a more knowledgeable version of the Solver, that *does* have access to outside information, provides high-quality learning examples for the Solver that doesn't. This helps the Solver learn more accurately without relying on its own potentially flawed answers.
Why it matters?
This research is important because it shows a way to make AI systems learn and improve more reliably without needing a lot of human-labeled data. DARC significantly boosts performance on various reasoning tasks, and even gets close to the results achieved by AI systems trained with extensive human help, making it a promising step towards more self-sufficient and powerful AI.
Abstract
Self-play with large language models has emerged as a promising paradigm for achieving self-improving artificial intelligence. However, existing self-play frameworks often suffer from optimization instability, due to (i) non-stationary objectives induced by solver-dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self-generated pseudo-labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two-stage framework that stabilizes the self-evolution process. First, we train the Questioner to synthesize difficulty-calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self-distillation mechanism, where a document-augmented teacher generates high-quality pseudo-labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model-agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human annotations.The code is available at https://github.com/RUCBM/DARC.