Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

Philippe Laban, Alexander R. Fabbri, Caiming Xiong, Chien-Sheng Wu

2024-07-03

Summary

This paper talks about a new evaluation method called 'Summary of a Haystack' (SummHay) that tests how well large language models (LLMs) and Retrieval Augmented Generation (RAG) systems can summarize long documents and handle complex information.

What's the problem?

The main problem is that while LLMs and RAG systems can process a lot of text, it's hard to evaluate how well they perform on tasks that require understanding long contexts. Existing evaluation methods, like the Needle-in-a-Haystack task, are too simple and don't really show the true capabilities of these advanced systems.

What's the solution?

To solve this issue, the authors created the SummHay task, which involves synthesizing 'Haystacks' of documents that contain repeated key insights. The goal is for the AI to summarize these insights based on a query and cite the original documents accurately. They designed a way to automatically evaluate the summaries based on how well they cover the expected insights and how accurately they cite the sources. They tested this method on 10 LLMs and 50 RAG systems, finding that even the best models struggled to reach human-level performance.

Why it matters?

This research is important because it highlights the challenges that current AI systems face when dealing with long texts. By developing a more complex evaluation method, it encourages improvements in AI summarization capabilities, which can lead to better performance in real-world applications like news summarization or academic research.

Abstract

LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In this work, we argue that summarization can play a central role in such evaluation. We design a procedure to synthesize Haystacks of documents, ensuring that specific insights repeat across documents. The "Summary of a Haystack" (SummHay) task then requires a system to process the Haystack and generate, given a query, a summary that identifies the relevant insights and precisely cites the source documents. Since we have precise knowledge of what insights should appear in a haystack summary and what documents should be cited, we implement a highly reproducible automatic evaluation that can score summaries on two aspects - Coverage and Citation. We generate Haystacks in two domains (conversation, news), and perform a large-scale evaluation of 10 LLMs and corresponding 50 RAG systems. Our findings indicate that SummHay is an open challenge for current systems, as even systems provided with an Oracle signal of document relevance lag our estimate of human performance (56\%) by 10+ points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to study enterprise RAG systems and position bias in long-context models. We hope future systems can equal and surpass human performance on SummHay.

View Paper