Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

Zackary Rackauckas, Arthur Câmara, Jakub Zavrel

2024-06-24

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

Summary

This paper presents RAGElo, a new framework designed to evaluate Retrieval-Augmented Generation (RAG) systems, particularly focusing on how well they perform in answering questions using AI.

What's the problem?

Evaluating RAG systems can be challenging due to issues like 'hallucination,' where the AI makes up information that isn't accurate, and the lack of standard benchmarks for internal company tasks. This makes it hard to compare different RAG methods, such as RAG-Fusion (RAGF), especially in specific contexts like product question-answering at Infineon Technologies.

What's the solution?

To address these challenges, the authors developed a comprehensive evaluation framework that uses Large Language Models (LLMs) to create synthetic queries based on actual user questions and relevant documents. The framework employs LLMs as judges to rate the retrieved documents and answers, ensuring a thorough evaluation of their quality. They also rank different RAG models using an automated Elo-based competition system, which helps determine which models perform best. Their findings show that RAGF outperforms other models in terms of completeness but not in precision, meaning it provides more complete answers but may not always be as accurate.

Why it matters?

This research is significant because it introduces a systematic way to evaluate AI systems that generate answers based on retrieved information. By improving how we assess these systems, RAGElo can help developers create better AI tools for answering questions accurately and efficiently. This is particularly important in industries where precise information is crucial, such as technology and customer service.

Abstract

Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria.

View Paper