Redefining Retrieval Evaluation in the Era of LLMs

Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, Fabrizio Silvestri

2025-10-27

Redefining Retrieval Evaluation in the Era of LLMs

Summary

This paper points out that the ways we usually measure how good a search system is don't work well when that search system is used to help a computer generate answers, like in systems that combine searching with large language models (LLMs).

What's the problem?

Traditional search metrics assume people look at search results one by one, quickly losing interest in lower-ranked results. However, LLMs don't work like that – they consider *all* the search results at once. Also, existing metrics don't penalize search results that aren't directly relevant but actually make it *harder* for the LLM to generate a good answer. Essentially, the old ways of measuring search quality don't reflect how LLMs actually use the information they're given.

What's the solution?

The researchers created a new way to label search results, focusing on how helpful relevant passages are and how harmful distracting passages are. Then, they developed a new metric called UDCG, which takes into account both the usefulness of information and the potential for distraction, and is designed to better match how LLMs process information. They tested this new metric on several datasets with different LLMs.

Why it matters?

This work is important because it provides a more accurate way to evaluate the performance of search systems used in conjunction with LLMs. By better understanding how well the search component is working, developers can build more reliable and effective systems that generate higher-quality answers.

Abstract

Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components

View Paper