MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

Rushi Qiang, Yuchen Zhuang, Anikait Singh, Percy Liang, Chao Zhang, Sherry Yang, Bo Dai

2025-10-09

MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

Summary

This paper introduces a new system called MLE-Smith that automatically creates challenges for testing how well language models can do machine learning engineering tasks, like building and improving models.

What's the problem?

Currently, creating these tests is really hard work. It requires people to manually design each task, which takes a lot of time and doesn't scale well – meaning you can't easily create a large and diverse set of tests. Existing tests also aren't very adaptable to different real-world situations.

What's the solution?

MLE-Smith uses a system of 'agents' working together to automatically turn raw data into usable machine learning challenges. It first generates a task, then verifies it's well-formed and makes sense, and finally tests if it can actually be solved. This process ensures the tasks are high-quality, realistic, and varied. They tested it on hundreds of datasets and created over 600 tasks.

Why it matters?

This is important because it allows us to more effectively test and improve language models' ability to automate machine learning. By automatically generating a large number of high-quality tasks, we can get a better understanding of how well these models perform and identify areas where they need improvement, ultimately speeding up the development of automated machine learning tools.

Abstract

While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.

View Paper