MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs
Lei Wang, Shan Dong, Yuhui Xu, Hanze Dong, Yalu Wang, Amrita Saha, Ee-Peng Lim, Caiming Xiong, Doyen Sahoo
2024-10-08

Summary
This paper introduces MathHay, an automated benchmark designed to evaluate how well large language models (LLMs) can handle complex mathematical reasoning tasks that require understanding information from long texts.
What's the problem?
While LLMs have shown they can work with long pieces of text, there hasn't been a specific benchmark to test their ability to solve math problems that involve multiple steps and require pulling information from lengthy contexts. This is important because many real-world applications need models that can understand and reason through complex mathematical scenarios, but existing benchmarks do not adequately assess this capability.
What's the solution?
To fill this gap, the authors created MathHay, which includes a variety of challenging math problems that require LLMs to not only find information but also perform multi-step reasoning. The benchmark features different types of tasks that vary in difficulty, allowing researchers to see how well the models can integrate information from extended contexts. The authors tested eight top-performing LLMs using MathHay and found that even the best model struggled, achieving only about 51% accuracy when dealing with long contexts.
Why it matters?
This research is important because it highlights the limitations of current LLMs in performing mathematical reasoning over long texts. By establishing MathHay as a benchmark, it provides a way for researchers to better understand the strengths and weaknesses of these models, which can lead to improvements in their design. Ultimately, enhancing LLMs' ability to tackle complex math problems will be beneficial for various fields, including education, engineering, and data analysis.
Abstract
Recent large language models (LLMs) have demonstrated versatile capabilities in long-context scenarios. Although some recent benchmarks have been developed to evaluate the long-context capabilities of LLMs, there is a lack of benchmarks evaluating the mathematical reasoning abilities of LLMs over long contexts, which is crucial for LLMs' application in real-world scenarios. In this paper, we introduce MathHay, an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. Unlike previous benchmarks like Needle in a Haystack, which focus primarily on information retrieval within long texts, MathHay demands models with both information-seeking and complex mathematical reasoning abilities. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing LLMs. Even the best-performing model, Gemini-1.5-Pro-002, still struggles with mathematical reasoning over long contexts, achieving only 51.26% accuracy at 128K tokens. This highlights the significant room for improvement on the MathHay benchmark.