Large Language Models and Mathematical Reasoning Failures

Johan Boye, Birger Moell

2025-02-18

Large Language Models and Mathematical Reasoning Failures

Summary

This paper talks about how well large language models (LLMs) can solve math problems and where they struggle. The researchers created a set of 50 high school math word problems to test different AI models and looked closely at how these models tried to solve the problems, not just whether they got the right answer.

What's the problem?

Current ways of testing AI's math skills usually only care about getting the right answer, which doesn't show if the AI actually understands the problem. Also, even the best AI models make mistakes in basic math and logical thinking, which is a big issue if we want to use them for real-world problem-solving.

What's the solution?

The researchers tested eight top AI models on their special set of math problems. They carefully looked at both the final answers and how the AI tried to solve each problem. This helped them find common mistakes the AI made, like making wrong guesses, relying too much on number patterns, and having trouble turning real-world situations into math problems.

Why it matters?

This research matters because it shows that even the smartest AI still has trouble with the kind of thinking we use to solve everyday math problems. It warns us not to trust AI too much for problem-solving yet and points out where AI needs to improve. This could help make future AI systems better at logical thinking and more reliable for tasks that need careful reasoning.

Abstract

This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs' problem-solving proficiency. The study highlights persistent gaps in LLMs' generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.

View Paper