Not All LLM Reasoners Are Created Equal

Arian Hosseini, Alessandro Sordoni, Daniel Toyama, Aaron Courville, Rishabh Agarwal

2024-10-03

Summary

This paper examines how well large language models (LLMs) can solve grade-school math problems and highlights significant differences in their reasoning abilities when faced with related problems.

What's the problem?

Many LLMs can generate answers to math problems, but they often struggle with more complex questions that require solving multiple problems in sequence. This means that their performance can vary greatly depending on whether they are solving questions independently or as part of a series where one answer relies on the previous one. This inconsistency can lead to frustration and errors, especially in smaller or specialized models.

What's the solution?

The researchers tested various LLMs on pairs of math problems to see how well they could handle situations where the second problem depended on the first. They found that most models performed worse when answering related questions compared to answering them separately. To improve the models, they analyzed how different training methods affected reasoning abilities and discovered that smaller models had a harder time with these types of problems. They also noted that distractions from extra information and poor reasoning skills contributed to the gaps in performance.

Why it matters?

This research is important because it reveals that not all LLMs are equally capable of reasoning through math problems, which can impact their reliability in educational settings and other applications. Understanding these differences can help developers create better models that improve students' learning experiences and ensure more accurate problem-solving capabilities.

Abstract

We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.

View Paper