FormalMATH, a large-scale Lean4 benchmark, introduces an autoformalization pipeline to reduce manual annotation costs and identifies limitations of existing LLM-based theorem provers in formal mathematical reasoning.

This paper talks about FormalMATH, a new way to test how well large language models can handle formal math problems by using a huge collection of math questions and an automatic system to turn regular math into a format computers can understand.

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract