EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty
Yuchen Tian, Ruiyuan Huang, Xuanwu Wang, Jing Ma, Zengfeng Huang, Ziyang Luo, Hongzhan Lin, Da Zheng, Lun Du
2025-10-07
Summary
This paper focuses on improving the ability of large language models (LLMs) to automatically prove mathematical theorems. Current LLMs struggle with even small changes to how a problem is written and don't perform consistently well across different types of problems.
What's the problem?
Large language models show promise in formal theorem proving, but they aren't very reliable. They often fail when the problem statement is slightly altered, meaning they lack 'generalizability' and are 'fragile'. Essentially, they can solve a problem in one form but fail on a semantically identical problem presented differently.
What's the solution?
The researchers created a system called EvolProver by first building a better training dataset. They did this through a process called 'data augmentation', meaning they generated many variations of existing theorems. They used three main techniques: EvolAST which changes the structure of the theorem while keeping the meaning the same, EvolDomain which translates theorems between different areas of math, and EvolDifficulty which creates new theorems of varying difficulty levels. They then used this expanded dataset to train a 7 billion parameter LLM, EvolProver.
Why it matters?
EvolProver achieved state-of-the-art results on several challenging theorem proving benchmarks, outperforming other models of similar size, even those designed with reasoning capabilities. This demonstrates that a carefully constructed training dataset can significantly improve the performance of LLMs in this area, potentially leading to more reliable and powerful automated theorem proving tools. It shows that focusing on data quality can be more effective than simply increasing model size.
Abstract
Large Language Models (LLMs) for formal theorem proving have shown significant promise, yet they often lack generalizability and are fragile to even minor transformations of problem statements. To address this limitation, we introduce a novel data augmentation pipeline designed to enhance model robustness from two perspectives: symmetry and difficulty. From the symmetry perspective, we propose two complementary methods: EvolAST, an Abstract Syntax Tree (AST) based approach that targets syntactic symmetry to generate semantically equivalent problem variants, and EvolDomain, which leverages LLMs to address semantic symmetry by translating theorems across mathematical domains. From the difficulty perspective, we propose EvolDifficulty, which uses carefully designed evolutionary instructions to guide LLMs in generating new theorems with a wider range of difficulty. We then use the evolved data to train EvolProver, a 7B-parameter non-reasoning theorem prover. EvolProver establishes a new state-of-the-art (SOTA) on FormalMATH-Lite with a 53.8% pass@32 rate, surpassing all models of comparable size, including reasoning-based models. It also sets new SOTA records for non-reasoning models on MiniF2F-Test (69.8% pass@32), Ineq-Comp-Seed (52.2% pass@32), and Ineq-Comp-Transformed (34.0% pass@32). Ablation studies further confirm our data augmentation pipeline's effectiveness across multiple benchmarks.