FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu, Tianyi Zhang, Chang Liu, Xiangyang Ji, Dianbo Liu

2025-10-17

FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

Summary

This paper introduces a new way to test computer programs designed to act like machine learning researchers, programs that can come up with ideas and run experiments on their own.

What's the problem?

Currently, it's really hard to tell how good these 'research agent' programs actually are. Existing tests focus too much on the technical details of *making* things work, rather than the quality of the *research* itself. They also don't cover a wide enough range of research problems, often focusing on practical applications instead of fundamental science, and they can't easily handle the complexity of real-world research projects.

What's the solution?

The researchers created a benchmark called FML-bench, which includes eight different, core machine learning research problems. This benchmark is designed to be easier to use, focuses on fundamental research questions, offers a lot of variety in the tasks, and can be expanded to include real-world projects. They also developed a way to measure how well these agents perform, using five different metrics to get a complete picture of their abilities. They then tested several existing research agents using FML-bench.

Why it matters?

The results showed that agents that explore a wide range of ideas are more successful than those that focus on refining a single idea in detail. This suggests that for automated research, it's better to be broad than narrow. FML-bench provides a better way to evaluate and improve these research agents, which could ultimately speed up progress in machine learning and other scientific fields.

Abstract

Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent's scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at https://github.com/qrzou/FML-bench.

View Paper