MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu
2025-02-21

Summary
This paper talks about MLGym and MLGym-Bench, new tools created by Meta to help develop and test AI systems that can do research tasks. It's like creating a special gym where AI can practice doing science experiments and learn how to be better researchers.
What's the problem?
Right now, we don't have good ways to train AI to do complex research tasks in fields like computer vision or natural language processing. It's hard to test if an AI can really think like a scientist and come up with new ideas or improve existing methods. Current AI models can do some basic tasks, but they're not great at the creative and analytical thinking needed for real research.
What's the solution?
The researchers created MLGym, which is like a virtual laboratory where AI can practice doing research tasks. They also made MLGym-Bench, which has 13 different research challenges for AI to try. These challenges cover many areas of AI research and require skills like coming up with new ideas, processing data, and analyzing results. They tested some of the most advanced AI models on these challenges to see how well they could do.
Why it matters?
This matters because it could help us create AI that can actually do scientific research on its own or help human scientists more effectively. If we can train AI to think more like researchers, it could speed up scientific discoveries and innovations in many fields. It also gives us a way to measure how close we are to having AI that can truly contribute to scientific research, which is a big step towards more advanced and helpful AI systems.
Abstract
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.