Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal Faruqui

2024-09-23

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

Summary

This paper discusses a new evaluation method called FRAMES, which is designed to assess how well large language models (LLMs) can handle complex tasks that involve retrieving information and generating responses. It focuses on improving the understanding of how these models perform in real-world situations, especially when answering questions that require multiple steps.

What's the problem?

As LLMs become more popular for tasks like answering questions and generating text, it's important to evaluate their performance accurately. However, existing evaluation methods often look at specific skills in isolation, which doesn't give a complete picture of how well these models can work together to provide accurate answers. This makes it hard to know if they can truly understand and synthesize information from different sources.

What's the solution?

To tackle this issue, the researchers created the FRAMES dataset, which includes 824 multi-hop questions that require the model to pull information from multiple documents to generate accurate answers. They tested LLMs using this dataset and found that while these models struggled with single-step retrieval tasks, they performed better when using a multi-step retrieval approach, improving their accuracy significantly. This dataset allows for a more holistic evaluation of LLMs' abilities in real-world scenarios.

Why it matters?

This research is significant because it provides a new way to evaluate AI systems that combine information retrieval with language generation. By using the FRAMES dataset, developers can better understand the strengths and weaknesses of their models, leading to improvements in how AI handles complex questions. This could enhance applications in areas like customer service, education, and research, where accurate information retrieval is crucial.

Abstract

Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.

View Paper