Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges

Safal Shrestha, Minwu Kim, Keith Ross

2025-02-14

Mathematical Reasoning in Large Language Models: Assessing Logical and
Arithmetic Errors across Wide Numerical Ranges

Summary

This paper talks about a new way to test how well AI language models can do math problems with really big or small numbers, and how to better understand where these models make mistakes in their reasoning.

What's the problem?

Current tests for AI math skills use problems with simple numbers, which doesn't show how well the AI can handle real-world math with very large or small numbers. Also, these tests only check if the final answer is right, not how the AI got there, which means we can't tell if the AI is actually thinking correctly or just getting lucky.

What's the solution?

The researchers created GSM-Ranges, a tool that takes existing math problems and changes the numbers to be much bigger or smaller. They also came up with a new way to grade the AI's work that looks at whether mistakes are because of wrong thinking (logical errors) or just calculation mistakes. They tested different AI models with these new problems and grading system.

Why it matters?

This matters because it helps us understand where AI models struggle with math, especially with big numbers or complex word problems. By finding these weaknesses, we can make better AI that can handle more realistic math problems. This could lead to AI that's more useful for real-world tasks that involve numbers and calculations, like finance or engineering.

Abstract

Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-<PRE_TAG>logical errors</POST_TAG>, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

View Paper