< Explain other AI papers

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, Bing Liu

2025-11-14

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

Summary

This paper introduces a new way to test how well computer programs can do 'deep research' – that is, answer complex questions that require looking at lots of information and putting it all together. These programs use powerful AI called large language models.

What's the problem?

Currently, it's really hard to judge how good these AI research programs are. Their answers are long, there are many possible correct answers, and the information they need changes all the time. Existing tests aren't detailed enough to really pinpoint *why* an AI is succeeding or failing at research tasks.

What's the solution?

The researchers created a benchmark called ResearchRubrics. This includes over 2,500 questions with very specific guidelines (rubrics) for evaluating the answers, created by experts. They also came up with a system for classifying how difficult a research question is based on how much thinking and information gathering it requires. Finally, they developed ways to automatically check how well AI answers match the expert guidelines.

Why it matters?

This work is important because it provides a much more reliable way to measure and improve AI's ability to do deep research. The tests show that even the best AI systems still struggle with understanding context and reasoning correctly, and releasing these tools will help developers build better, more trustworthy AI research assistants.

Abstract

Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Gemini's DR and OpenAI's DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics(including all prompts, rubrics, and evaluation code) to facilitate progress toward well-justified research assistants.