TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

Ruida Wang, Jipeng Zhang, Yizhen Jia, Rui Pan, Shizhe Diao, Renjie Pi, Tong Zhang

2024-07-10

TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

Summary

This paper talks about TheoremLlama, a new system designed to help large language models (LLMs) become experts in using Lean4, a formal language used for proving mathematical theorems. The aim is to improve how these models generate complete and accurate proofs.

What's the problem?

The main problem is that while LLMs can understand and generate natural language, they struggle with formal theorem proving due to a lack of training data that connects natural language proofs with the formal language used in Lean. This scarcity limits their ability to create accurate proofs and makes it hard to train them effectively.

What's the solution?

To address this issue, the authors propose TheoremLlama, which is an end-to-end framework that trains general-purpose LLMs to become proficient in Lean4. This system includes methods for generating datasets that align natural language with formal language proofs, as well as training techniques specifically designed for theorem proving. One of the key innovations is the NL-FL bootstrapping method, which integrates natural language proofs into Lean4 code, helping the model learn how to reason formally based on its understanding of natural language.

Why it matters?

This research is important because it enhances the ability of AI systems to perform formal reasoning and theorem proving, which are critical in mathematics and computer science. By improving how LLMs can generate and verify proofs, TheoremLlama opens new possibilities for automated reasoning, making it easier to tackle complex mathematical problems and potentially advancing the field of mathematics itself.

Abstract

Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. Similar methods have shown promising results in code generation. However, most modern LLMs exhibit suboptimal performance due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data. This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address the challenges, this paper proposes **TheoremLlama**, an end-to-end framework to train a general-purpose LLM to become a Lean4 expert. This framework encompasses NL-FL aligned dataset generation methods, training approaches for the LLM formal theorem prover, and techniques for LLM Lean4 proof writing. Using the dataset generation method, we provide *Open Bootstrapped Theorems* (OBT), an NL-FL aligned and bootstrapped dataset. A key innovation in this framework is the NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leveraging the NL reasoning ability of LLMs for formal reasoning. The **TheoremLlama** framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. We have also open-sourced our model checkpoints and generated dataset, and will soon make all the code publicly available.

View Paper