miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh

2025-11-17

miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

Summary

This paper investigates how well AI systems can solve math problems like those found in math competitions, specifically focusing on a benchmark called miniF2F. The AI needs to understand the problem written in plain English, translate it into a formal mathematical language called Lean, and then actually prove the problem to get credit.

What's the problem?

Current AI models aren't very good at solving these types of math problems end-to-end. While they perform well at either translating English to Lean or proving theorems *separately*, their overall accuracy when doing both in sequence is much lower. The researchers found that a big part of this problem isn't necessarily the AI's reasoning ability, but rather inconsistencies and errors between the way the problems are stated informally (in English) and formally (in Lean) within the miniF2F benchmark itself.

What's the solution?

The researchers carefully went through all the problems in miniF2F and corrected errors, inconsistencies, and simplifications in both the English statements and the Lean formalizations. They created a new, improved version of the benchmark called miniF2F-v2, where the informal and formal statements perfectly match and have verified proofs. They then tested the AI models on this new benchmark and saw a significant improvement in accuracy.

Why it matters?

This work highlights the importance of high-quality benchmarks for evaluating AI systems that deal with formal reasoning, like proving mathematical theorems. If the benchmark itself is flawed, it's hard to tell if the AI is failing because it can't reason well, or because it's being given incorrect or ambiguous information. A better benchmark allows researchers to more accurately assess progress and identify where AI models still struggle.

Abstract

We perform a thorough analysis of the formal and informal statements in the miniF2F benchmark from the perspective of an AI system that is tasked to participate in a math Olympiad consisting of the problems in miniF2F. In such setting, the model has to read and comprehend the problems in natural language, formalize them in Lean language, then proceed with proving the problems, and it will get credit for each problem if the formal proof corresponds to the original informal statement presented to the model. Our evaluation results reveal that the best accuracy of such pipeline can be about 36% using the SoTA models in the literature, considerably lower than the individual SoTA accuracies, 97% and 69% reported in the autoformalization and theorem proving literature. Analyzing the failure modes, we trace back a considerable portion of this drop to discrepancies between the formal and informal statements for more than half of the problems in miniF2F. We proceed with correcting all the errors, discrepancies and simplifications in formal and informal statements, and present the miniF2F-v2 with fully verified formal and informal statements and proofs. Evaluating the full theorem proving pipeline on miniF2F-v2 leads to the best accuracy of 70%, a significant improvement from the 40% on the original miniF2F, yet indicating considerable misalignment between the autoformalization models and theorem provers. Our deep analysis suggests that a higher quality benchmark can help the community better evaluate progress in the field of formal reasoning and also better diagnose the failure and success modes of autoformalization and theorem proving models. Our dataset is available at https://github.com/roozbeh-yz/miniF2F_v2.

View Paper