Bootstrapping Task Spaces for Self-Improvement
Minqi Jiang, Andrei Lupu, Yoram Bachrach
2025-09-08
Summary
This paper introduces a new method called Exploratory Iteration, or ExIt, for training AI agents – specifically large language models – to get better at tasks by repeatedly revising their own work, even when they aren't explicitly told how many times to revise.
What's the problem?
Typically, when training an AI to improve its answers through multiple attempts, researchers have to set a limit on how many revisions the AI is allowed to make. This limit is often arbitrary and can be inefficient, as some tasks might need more revisions than others. It's also costly to train an AI to handle a very large number of potential revision steps.
What's the solution?
ExIt tackles this by cleverly choosing which revision steps to focus on during training. Instead of training the AI on *every* possible revision, it only trains on the most helpful or 'informative' steps. It essentially creates new training examples from the intermediate results of an AI's self-improvement process, treating each step as a new task. The AI also uses exploration techniques to try out a wider variety of revision paths. This allows the AI to learn to improve itself effectively even when faced with tasks it hasn't seen before, and to continue improving beyond the typical number of steps it was trained on.
Why it matters?
This research is important because it allows AI systems to become more self-sufficient and adaptable. By learning to reliably improve their own work, these AIs can tackle more complex problems and achieve higher levels of performance without constant human intervention. This has implications for fields like solving math problems, using tools in a sequence, and even automating parts of machine learning itself.
Abstract
Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.