BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

2024-06-17

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Summary

This paper introduces BABILong, a new benchmark designed to evaluate how well large language models (LLMs) can reason with information spread across very long documents. It aims to test the models' ability to understand and connect facts that are not all in one place.

What's the problem?

As LLMs have become more advanced, they can handle larger amounts of text. However, current methods for testing these models haven't kept up, making it hard to know how well they can deal with long documents. Many existing evaluations do not effectively measure how well these models can find and use information that is scattered throughout lengthy texts, which is important for real-world applications like summarization and question answering.

What's the solution?

To address this issue, the authors created the BABILong benchmark, which includes 20 different reasoning tasks that require the models to connect facts over long stretches of text. These tasks involve various skills such as chaining facts together, making deductions, counting items, and managing lists. The benchmark allows researchers to see how well LLMs perform when they need to process a lot of information at once, especially when the relevant details are spread out.

Why it matters?

This research is important because it helps improve our understanding of how LLMs can be used in practical situations where long documents are common. By providing a structured way to evaluate these models on their ability to reason across lengthy texts, BABILong encourages further development in AI technologies that require deep comprehension and critical thinking skills. This could enhance applications in fields like education, law, and research where handling large amounts of information is essential.

Abstract

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

View Paper