DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

João Coelho, Jingjie Ning, Jingyuan He, Kangrui Mao, Abhijay Paladugu, Pranav Setlur, Jiahe Jin, Jamie Callan, João Magalhães, Bruno Martins, Chenyan Xiong

2025-05-29

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation
Sandbox for Deep Research

Summary

This paper talks about DeepResearchGym, a new free and open platform where researchers can test and compare deep learning systems in a fair and repeatable way.

What's the problem?

The problem is that it's often hard to fairly evaluate and compare different deep learning models because everyone might use different setups, data, or testing methods, making results unreliable or hard to reproduce.

What's the solution?

To solve this, the researchers created DeepResearchGym, which gives everyone access to the same tools, data, and evaluation methods. It uses a clear search system and lets large language models act as judges to assess how well different systems perform, making the whole process transparent and easy to repeat.

Why it matters?

This is important because it helps the AI research community build trust in results, speeds up progress by making it easier to compare new ideas, and encourages more people to participate in deep learning research.

Abstract

DeepResearchGym provides an open-source evaluation framework for deep research systems using a reproducible search API and LLM-as-a-judge assessments.

View Paper