Guided Self-Evolving LLMs with Minimal Human Supervision

Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, Dong Yu

2025-12-03

Guided Self-Evolving LLMs with Minimal Human Supervision

Summary

This paper explores how to make AI systems improve themselves through a process of self-learning, aiming for more intelligent models without constant human intervention.

What's the problem?

The big challenge is that when AI tries to learn and improve on its own, it often gets stuck or even gets worse over time. This happens because the AI can develop biases, start focusing on narrow areas, or lose the ability to explore new ideas. Essentially, it reinforces its own mistakes and doesn't continue to grow effectively.

What's the solution?

The researchers developed a system called R-Few, which uses a 'Challenger-Solver' approach with a little bit of human guidance. The Challenger creates new learning problems, but it gets help from humans who provide a small number of examples to keep it on track. The Solver then learns from both the human examples and the problems created by the Challenger, gradually tackling harder challenges. This process helps the AI learn in a more stable and controlled way.

Why it matters?

This work is important because it offers a way to build AI that can continuously improve itself with less reliance on humans. The results show that R-Few can significantly boost an AI's performance, even with much less human-labeled data compared to other methods, bringing us closer to the goal of truly intelligent and self-sufficient AI systems.

Abstract

AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.

View Paper