RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems
Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li
2025-10-17
Summary
This paper focuses on improving how well AI systems, specifically those using a technique called Retrieval-Augmented Generation (RAG), can answer complicated questions that require multiple steps of reasoning.
What's the problem?
Large Language Models (LLMs) are powerful, but they sometimes make things up, have outdated information, or just get facts wrong. RAG helps by letting the AI look up information before answering. More advanced RAG systems act like 'agents' that plan, search, and think through problems. However, even these agent-based systems struggle with really complex questions that need several steps to solve, and we don't fully understand *how* they're thinking (or failing to think) along the way.
What's the solution?
The researchers created a new testing tool called RAGCap-Bench. This tool doesn't just check if the final answer is right or wrong, but breaks down the process into smaller tasks and tests how well the AI performs each step. They figured out what skills are needed for these steps and designed questions to specifically test those skills, also identifying common errors LLMs make. They then tested existing AI systems with this benchmark.
Why it matters?
This work is important because it shows that improving the AI's ability to handle these intermediate steps – the 'slow thinking' part – actually leads to better overall answers. It provides a way to specifically measure and improve these crucial reasoning skills in RAG systems, making them more reliable and accurate.
Abstract
Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.