OpenSIR: Open-Ended Self-Improving Reasoner
Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z. Pan, Marco Valentino, Pasquale Minervini
2025-11-04
Summary
This paper introduces a new way to improve large language models' ability to solve complex problems, specifically math problems, without relying on humans to constantly check their work.
What's the problem?
Currently, making language models better at reasoning often involves giving them problems with known answers to learn from. This limits how good they can become because they're always learning from pre-defined examples and can't really push beyond what humans have already shown them. Existing methods that try to let models learn by playing against themselves either need outside help to verify answers or struggle to continuously create new and challenging problems.
What's the solution?
The researchers created a system called OpenSIR where a language model acts as both a 'teacher' and a 'student'. The 'teacher' generates new math problems, aiming for problems that are difficult but not impossible, and also tries to create problems that cover different mathematical concepts. The 'student' then tries to solve these problems. This process repeats, with both roles improving over time, all without any human intervention. They started with a very simple problem and let the system build from there.
Why it matters?
This work is important because it shows a path towards creating language models that can learn and improve on their own, potentially surpassing human-level performance in areas like math. By removing the need for human-annotated data and external verification, it opens up possibilities for more advanced and autonomous AI systems that can discover new knowledge and solve problems we haven't even thought of yet.
Abstract
Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models' ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.