GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

2024-10-08

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Summary

This paper introduces GSM-Symbolic, a new benchmark designed to better evaluate the mathematical reasoning abilities of large language models (LLMs) and to understand their limitations in solving math problems.

What's the problem?

While LLMs have improved in solving grade-school-level math questions using the GSM8K benchmark, there are concerns about whether these improvements reflect true reasoning abilities. Many models seem to memorize answers rather than genuinely understand the math concepts, leading to inconsistencies in performance, especially when questions are slightly altered. This raises doubts about the reliability of their reported capabilities.

What's the solution?

To address these issues, the authors created GSM-Symbolic, which uses symbolic templates to generate a wide variety of math questions. This allows for more controlled evaluations of LLMs' reasoning skills. The study found that when small changes were made to questions, such as altering numerical values, the models' performance dropped significantly. Additionally, the research showed that as questions became more complex, LLMs struggled even more, indicating that they rely on patterns from their training data rather than true logical reasoning.

Why it matters?

This research is important because it highlights the limitations of current LLMs in mathematical reasoning and provides a more nuanced understanding of how these models operate. By using GSM-Symbolic for evaluation, researchers can better assess the strengths and weaknesses of LLMs, which could lead to improvements in their design and training. This is crucial for applications that rely on accurate mathematical reasoning, such as education and scientific research.

Abstract

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

View Paper