A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility
Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, Matthias Bethge
2025-04-10
Summary
This paper talks about how AI models that solve math problems are often tested in ways that make their progress look better than it really is, because small changes in how they’re tested can lead to big differences in results.
What's the problem?
When researchers test AI models on math problems, tiny tweaks—like how the questions are worded or the computer setup used—can cause wildly different scores, making it hard to know if the AI is actually improving or just getting lucky.
What's the solution?
The study creates a fairer testing system with clear rules and shares all the tools needed to check results, showing that simpler training methods often work better than flashy ones that claim big improvements but don’t hold up under stricter tests.
Why it matters?
This helps make AI research more honest and reliable, so future models can be built on solid evidence instead of hype, leading to better tutors, calculators, and problem-solving tools.
Abstract
Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with many evaluations relying on benchmarking practices that lack transparency, robustness, or statistical grounding. In this work, we conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices - including decoding parameters, random seeds, prompt formatting, and even hardware and software-framework configurations. Performance gains reported in recent studies frequently hinge on unclear comparisons or unreported sources of variance. To address these issues, we propose a standardized evaluation framework with clearly defined best practices and reporting standards. Using this framework, we reassess recent methods and find that reinforcement learning (RL) approaches yield only modest improvements - far below prior claims - and are prone to overfitting, especially on small-scale benchmarks like AIME24. In contrast, supervised finetuning (SFT) methods show consistently stronger generalization. To foster reproducibility, we release all code, prompts, and model outputs, for reasoning benchmarks, establishing more rigorous foundations for future work.