Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
Ran Xin, Zeyu Zheng, Yanchen Nie, Kun Yuan, Xia Xiao
2025-09-09
Summary
This paper introduces a new system called BFS-Prover-V2 that uses large language models to automatically prove mathematical theorems, and it tackles the challenges of making these systems both trainable and efficient when actually trying to solve problems.
What's the problem?
Using large language models for theorem proving is promising, but it runs into two major roadblocks. First, teaching the model through trial and error (reinforcement learning) takes a lot of computing power and often plateaus, meaning the model stops getting better. Second, actually *using* the model to find proofs can be incredibly slow and require a lot of computational resources, especially for complex theorems.
What's the solution?
The researchers addressed these problems with two key ideas. First, they developed a smarter way to train the language model using a system inspired by AlphaZero, which involves filtering the training data and periodically retraining the model to avoid those performance plateaus. Second, they created a 'planner' that breaks down difficult theorems into smaller, more manageable steps, and then uses a team of 'prover agents' working in parallel to solve those steps, sharing their progress in a common 'proof cache'. This division of labor makes the process much faster and more efficient.
Why it matters?
This work is important because it significantly improves the ability of AI to automatically prove mathematical theorems, achieving top results on standard tests. But beyond math, the techniques used – specifically the improved training method and the planner-enhanced search – could be applied to other areas where complex reasoning and problem-solving are needed, like coding or scientific discovery.
Abstract
The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces BFS-Prover-V2, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. BFS-Prover-V2 achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.