AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Chaurasia, Abhishek Charnalia

2026-02-10

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Summary

This paper introduces AIRS-Bench, a set of 20 challenging tasks designed to test how well AI agents can perform scientific research, covering areas like language, math, biology, and predicting trends over time.

What's the problem?

Currently, it's hard to objectively measure how good AI agents are at actually *doing* science. Existing benchmarks often give agents the code they need to start, which doesn't test their ability to come up with ideas, analyze results, and improve their approach on their own. Researchers needed a way to evaluate the full range of skills needed for scientific discovery, from start to finish, without hand-holding the AI.

What's the solution?

The researchers created AIRS-Bench, a collection of tasks taken directly from recent, advanced machine learning research. These tasks require agents to handle the entire research process – brainstorming, running experiments, and refining their methods – without being given any starting code. They then tested several powerful AI models on these tasks, using different ways to guide the AI's thinking, and compared the results to human performance.

Why it matters?

This work is important because it provides a tough, realistic test for AI agents aiming to contribute to scientific progress. The results show that while AI can sometimes outperform humans on specific parts of the research process, it still has a long way to go before it can consistently match or exceed human scientists. By making AIRS-Bench publicly available, the researchers hope to encourage further development of AI systems that can truly accelerate scientific discovery.

Abstract

LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.

View Paper