IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval

Tingyu Song, Guo Gan, Mingsheng Shang, Yilun Zhao

2025-03-07

IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in
Expert-Domain Information Retrieval

Summary

This paper talks about IFIR, a new tool for testing how well AI systems can find information in specialized fields like finance, law, healthcare, and science when given specific instructions

What's the problem?

Current AI systems struggle to follow complex instructions when searching for information in expert fields. It's hard to tell how good they really are at this task because there hasn't been a good way to test them

What's the solution?

The researchers created IFIR, which includes over 2,400 high-quality examples of information retrieval tasks across four expert domains. They made the instructions vary in complexity to test different levels of AI ability. They also came up with a new way to evaluate how well AI follows instructions using another AI system

Why it matters?

This matters because as AI becomes more involved in specialized fields, we need to make sure it can accurately find and use expert information. IFIR helps us understand where current AI systems fall short and guides researchers in making better information retrieval systems for complex, real-world tasks in expert domains

Abstract

We introduce IFIR, the first comprehensive benchmark designed to evaluate instruction-following information retrieval (IR) in expert domains. IFIR includes 2,426 high-quality examples and covers eight subsets across four specialized domains: finance, law, healthcare, and science literature. Each subset addresses one or more domain-specific retrieval tasks, replicating real-world scenarios where customized instructions are critical. IFIR enables a detailed analysis of instruction-following retrieval capabilities by incorporating instructions at different levels of complexity. We also propose a novel LLM-based evaluation method to provide a more precise and reliable assessment of model performance in following instructions. Through extensive experiments on 15 frontier retrieval models, including those based on LLMs, our results reveal that current models face significant challenges in effectively following complex, domain-specific instructions. We further provide in-depth analyses to highlight these limitations, offering valuable insights to guide future advancements in retriever development.

View Paper