The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

Bingchen Zhao, Despoina Magka, Minqi Jiang, Xian Li, Roberta Raileanu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Kelvin Niu, Shagun Sodhani, Michael Shvartsman, Andrei Lupu, Alisia Lupidi, Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Thomas Foster, Lucia Cipolina-Kun, Abhishek Charnalia, Derek Dunfield, Alexander H. Miller, Oisin Mac Aodha, Jakob Foerster

2025-06-30

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT
Improvements

Summary

This paper talks about the Automated LLM Speedrunning Benchmark, a test designed to see how well AI models can copy and reproduce improvements made in training language models, using tasks from the NanoGPT speedrun competition.

What's the problem?

Even though many advancements have been made to train language models faster and better, current AI systems still have difficulty re-creating these improvements on their own, which shows a gap in AI's ability to understand and apply complex scientific progress.

What's the solution?

The paper created a benchmark with 19 stages where AI agents are given previous training scripts and hints, and must produce new scripts that achieve faster training times. This setup helps measure how well the AI can reproduce scientific results in a realistic and structured way.

Why it matters?

This matters because being able to reproduce scientific discoveries is important for trustworthy and reliable AI research. Improving AI's ability to do this is a key step toward making AI systems that can help with scientific progress on their own in the future.

Abstract

An Automated LLM Speedrunning Benchmark evaluates AI agents' ability to reproduce scientific results by leveraging NanoGPT speedrun tasks, indicating that even recent reasoning LLMs struggle with re-implementing known improvements.

View Paper