LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, Jingang Wang, Liulin Feng, Mengxia Shen, Qi Li, Shengnan An, Shun Wang, Wei Shi, Xiangyu Xi, Xiaoyu Li, Xuezhi Cao, Yi Lu, Yunke Zhao

2026-03-24

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Summary

This paper introduces LongCat-Flash-Prover, a very large and powerful artificial intelligence model designed to formally prove mathematical theorems. It's an open-source tool, meaning anyone can use and study it, and it's specifically built to work with a formal proof language called Lean4.

What's the problem?

Formally proving mathematical theorems is incredibly difficult for computers. It requires not just knowing math, but also translating informal ideas into a precise, logical format the computer understands (auto-formalization), creating a plan for the proof (sketching), and then actually carrying out the proof step-by-step (proving). Existing AI models struggle with the length and complexity of these tasks, often getting stuck or making mistakes. Training these large models is also unstable and prone to issues where the AI finds 'cheats' to get rewards without actually solving the problem.

What's the solution?

The researchers broke down formal reasoning into those three parts – auto-formalization, sketching, and proving – and built a model to handle each one. They used a 'Mixture of Experts' approach, which is like having a team of specialized AI experts working together. To train this complex model, they developed a new training method called 'Hierarchical Importance Sampling Policy Optimization' which helps stabilize the learning process and prevents the AI from finding those unwanted 'cheats' by checking for logical consistency and valid proof steps. Essentially, they created a system that guides the AI to build correct and meaningful proofs.

Why it matters?

This work represents a significant step forward in automated theorem proving. LongCat-Flash-Prover outperforms other publicly available models on several challenging benchmarks, meaning it can solve more complex problems. Its high success rate with a limited number of attempts demonstrates it's also very efficient. This could eventually lead to AI tools that assist mathematicians in discovering new theorems and verifying existing ones, accelerating progress in mathematics and related fields.

Abstract

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.

View Paper