Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Jonathan Roberts, Kai Han, Samuel Albanie

2024-11-08

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Summary

This paper explores how well large language models (LLMs) can follow threads of information across long contexts in various documents, focusing on their ability to extract relevant details from large amounts of text.

What's the problem?

As LLMs have improved, they can handle larger amounts of text at once. However, there is still a challenge in understanding how effectively these models can use this context, especially when trying to find specific information spread across many pages or different documents. Many models might not perform well when the context is too long, leading to confusion and errors.

What's the solution?

The researchers conducted experiments with 17 different LLMs to see how well they could track multiple threads of information within their context limits. They found that many models are quite good at maintaining focus on several topics simultaneously without losing accuracy. However, they also discovered that for some models, the effective context they could handle was shorter than what they were designed for, meaning their performance dropped as the amount of text increased. Additionally, they emphasized that comparing token counts from different models can be misleading since they represent different amounts of actual text.

Why it matters?

This research is significant because it helps us understand the strengths and limitations of LLMs when dealing with complex information retrieval tasks. By identifying how effectively these models can follow threads of information, we can improve their design and application in real-world scenarios, making them more useful for tasks like research, data analysis, and more.

Abstract

As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.

View Paper