Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin

2025-05-23

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard
Negatives for Robust Information Retrieval

Summary

This paper talks about a new way to make search engines and information-finding AI work better by using large language models to fix mistakes in the data they learn from.

What's the problem?

Sometimes, the data used to train these AI systems has errors, like labeling a good answer as a bad one, which can confuse the AI and make it less accurate when finding or ranking information.

What's the solution?

The researchers used a process called cascading LLM prompts, where they ask the AI to double-check and correct these mistakes, especially the tricky ones, so the training data is more reliable and the AI can learn better.

Why it matters?

This matters because it makes search engines and information retrieval tools more accurate and trustworthy, which is important for students, professionals, and anyone who relies on finding the right information quickly.

Abstract

Using cascading LLM prompts to identify and relabel false negatives in datasets improves retrieval and reranking models' performance.

View Paper