MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Mądry

2024-10-10

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Summary

This paper introduces MLE-bench, a new benchmark designed to evaluate how well AI agents perform tasks related to machine learning engineering by using competitions from Kaggle.

What's the problem?

As AI agents become more capable, there's a need to measure their skills in machine learning engineering tasks, such as training models and preparing datasets. However, there hasn't been a comprehensive way to assess these capabilities across various real-world scenarios, making it hard to compare different AI agents and understand their progress.

What's the solution?

To address this, the authors created MLE-bench, which consists of 75 different Kaggle competitions that challenge AI agents to demonstrate their machine learning engineering skills. They established human performance baselines using Kaggle's leaderboards and tested several advanced language models on this benchmark. The best-performing model achieved results comparable to a bronze medal in about 16.9% of the competitions. They also explored how different resources and training methods affect the performance of these agents.

Why it matters?

This research is significant because it provides a structured way to evaluate the abilities of AI agents in machine learning engineering. By releasing MLE-bench and its associated code, the authors aim to encourage further research and development in this area, which could lead to more effective AI systems capable of performing complex tasks autonomously in fields like healthcare, finance, and technology.

Abstract

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.

View Paper