ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan

2026-02-18

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Summary

This paper introduces ResearchGym, a new way to test how well AI agents can actually *do* research, not just talk about it. It's like a simulated lab environment where an AI has to come up with ideas, run experiments, and try to get results, just like a scientist.

What's the problem?

Currently, we're really good at building AI that can *seem* intelligent, like chatbots that can write convincingly. But it's hard to know if these AIs can actually perform complex tasks that require planning, experimentation, and problem-solving over a long period of time. Existing benchmarks mostly test narrow skills, not the full research process. We need a way to see if AI can truly discover new things and improve on existing knowledge.

What's the solution?

The researchers took five real research papers from major AI conferences and created 'task environments' based on them. They kept the data, tools, and initial code used in those papers, but removed the core new idea from each paper. Then, they gave a powerful AI (GPT-5) the challenge of trying to solve the original research problem, using only the available resources. They tracked how well the AI did compared to the original researchers and identified where it struggled.

Why it matters?

This work is important because it shows that even very advanced AI agents still have significant weaknesses when it comes to performing complex, long-term research. The AI often made mistakes in planning, managing resources, and evaluating its own ideas. ResearchGym provides a standardized platform for testing and improving AI's research capabilities, which is crucial for making AI a truly valuable tool for scientific discovery.

Abstract

We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.

View Paper