Judging the Judges: A Collection of LLM-Generated Relevance Judgements

Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, Emine Yilmaz

2025-02-20

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

Summary

This paper talks about a challenge called LLMJudge, which tests how well AI language models can judge if a piece of information is relevant to a search query. It's like seeing if a computer can do the job of a human who decides whether search results are good or not.

What's the problem?

Creating good search systems requires a lot of human work to judge whether search results are relevant. This is especially hard for new topics or languages where there aren't many experts available. Using AI to do this job could save time and resources, but we're not sure how well AI can actually do this task compared to humans.

What's the solution?

The researchers organized a competition where different teams created AI systems to judge the relevance of search results. They collected 42 different sets of AI-generated judgments and compared them to see which methods worked best. They looked at things like how the AI was instructed (the 'prompt') and which AI model was used. All of this information was then made public for other researchers to study.

Why it matters?

This matters because it could revolutionize how we improve search engines and other information systems. If AI can reliably judge search results, we could create better search engines much faster and for more topics and languages. This could lead to better access to information worldwide, especially for underserved languages and topics. It also helps us understand the strengths and weaknesses of AI in tasks that usually require human judgment, which is important as AI becomes more involved in our daily lives.

Abstract

Using Large Language Models (LLMs) for relevance assessments offers promising opportunities to improve Information Retrieval (IR), Natural Language Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing IR experimenters to build evaluation collections with a fraction of the manual human labor currently required. This could help with fresh topics on which there is still limited knowledge and could mitigate the challenges of evaluating ranking systems in low-resource scenarios, where it is challenging to find human annotators. Given the fast-paced recent developments in the domain, many questions concerning LLMs as assessors are yet to be answered. Among the aspects that require further investigation, we can list the impact of various components in a relevance judgment generation pipeline, such as the prompt used or the LLM chosen. This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed. In detail, we release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track <PRE_TAG>relevance judgments</POST_TAG> produced by eight international teams who participated in the challenge. Given their diverse nature, these automatically generated <PRE_TAG>relevance judgments</POST_TAG> can help the community not only investigate systematic biases caused by LLMs but also explore the effectiveness of ensemble models, analyze the trade-offs between different models and human assessors, and advance methodologies for improving automated evaluation techniques. The released resource is available at the following link: https://llm4eval.github.io/LLMJudge-benchmark/

View Paper