Towards Robust Mathematical Reasoning

Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung

2025-11-04

Summary

This paper introduces a new set of challenging math problems, called IMO-Bench, designed to really test how well artificial intelligence can do mathematical reasoning, going beyond simply getting the right answer.

What's the problem?

Current ways of evaluating AI's math skills are either too simple or only check if the AI can give the correct final answer to a problem, without looking at the steps it took to get there. This doesn't accurately measure true mathematical understanding or the ability to solve complex problems like those found in the International Mathematical Olympiad (IMO), a very prestigious math competition for high school students.

What's the solution?

The researchers created IMO-Bench, which includes two parts: IMO-AnswerBench, which tests for correct short answers to 400 Olympiad problems, and IMO-Proof Bench, which assesses the AI's ability to write out complete mathematical proofs. They also developed a way to automatically grade these proofs, and used this benchmark to improve their own AI model, Gemini Deep Think, which then performed at a gold-medal level in a simulated IMO competition. They also created a dataset of human-graded proofs to help improve automatic grading systems.

Why it matters?

Developing AI that can truly reason mathematically is important for many fields. These benchmarks provide a more rigorous way to measure progress in this area and push AI development towards more robust and human-like mathematical problem-solving skills. The automatic grading tools also help speed up the process of evaluating and improving these AI systems.

Abstract

Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.

View Paper