GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
Sacha Muller, António Loison, Bilel Omrani, Gautier Viaud
2024-09-11
Summary
This paper talks about GroUSE, a benchmark designed to evaluate how well different models can judge the quality of answers generated by systems that combine retrieval and generation (RAG) for grounded question answering.
What's the problem?
When using large language models (LLMs) to evaluate answers generated by RAG systems, there are challenges in ensuring these models accurately assess the quality of the answers. Existing evaluation methods often miss important types of mistakes that can occur, which makes it hard to trust their judgments.
What's the solution?
To address this, the authors created GroUSE, a meta-evaluation benchmark consisting of 144 unit tests that help identify various failure modes in answer generation. They found that while some closed models performed well on GroUSE, many state-of-the-art open-source models did not meet the new evaluation criteria despite showing some correlation with the judgments made by GPT-4. They also discovered that fine-tuning a model called Llama-3 using reasoning examples from GPT-4 improved its ability to evaluate answers effectively.
Why it matters?
This research is important because it highlights the need for better evaluation methods for models that judge the quality of generated answers. By improving how we assess these models, we can ensure they provide more reliable evaluations, which is crucial for developing effective AI systems that rely on accurate question answering.
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria, despite strong correlation with GPT-4's judgement. Our findings suggest that correlation with GPT-4 is an incomplete proxy for the practical performance of judge models and should be supplemented with evaluations on unit tests for precise failure mode detection. We further show that finetuning Llama-3 on GPT-4's reasoning traces significantly boosts its evaluation capabilities, improving upon both correlation with GPT-4's evaluations and calibration on reference situations.