Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

Amanda Dsouza, Christopher Glaze, Changho Shin, Frederic Sala

2024-07-09

Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

Summary

This paper talks about a new evaluation framework called SWiM, designed to test how well large language models (LLMs) can handle long pieces of text. It identifies issues with these models when they try to understand information that is located in the middle of a long context.

What's the problem?

The main problem is that while some LLMs can process very large amounts of text (over 2 million tokens), their performance often drops when they need to retrieve information from the middle of this text. This is known as the 'lost-in-the-middle' effect, where important details can be overlooked, leading to less accurate responses.

What's the solution?

To address this issue, the authors developed the SWiM framework, which provides a better way to evaluate how these long context models perform. They tested eight different models, including well-known ones like GPT-4 and Claude 3 Opus, and found that they all struggled with the 'lost-in-the-middle' problem. Additionally, they introduced a method called medoid voting, which involves generating multiple responses for a question by mixing up the order of documents in the context and then choosing the most common answer. This approach improved accuracy by up to 24% on single document question-answering tasks.

Why it matters?

This research is important because it helps improve our understanding of how well LLMs can manage long texts in real-world applications. By developing better evaluation methods and solutions like medoid voting, we can enhance the performance of these models, making them more reliable for tasks that require processing large volumes of information, such as legal documents, research papers, or any extensive written content.

Abstract

Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. An exciting development in this space is models boasting extended context capabilities, with some accommodating over 2 million tokens. Such long context model capabilities remain uncertain in production systems, motivating the need to benchmark their performance on real world use cases. We address this challenge by proposing SWiM, an evaluation framework that addresses the limitations of standard tests. Testing the framework on eight long context models, we find that even strong models such as GPT-4 and Claude 3 Opus degrade in performance when information is present in the middle of the context window (lost-in-the-middle effect). Next, in addition to our benchmark, we propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect, by generating responses a few times, each time randomly permuting documents in the context, and selecting the medoid answer. We evaluate medoid voting on single document QA tasks, achieving up to a 24% lift in accuracy.

View Paper