CodeMonkeys: Scaling Test-Time Compute for Software Engineering

Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, Azalia Mirhoseini

2025-01-28

CodeMonkeys: Scaling Test-Time Compute for Software Engineering

Summary

This paper talks about CodeMonkeys, a new system that uses advanced AI techniques to solve real-world software problems. It focuses on improving how AI models use extra thinking time (test-time compute) to tackle complex coding issues more effectively.

What's the problem?

Current AI models struggle to solve complicated software engineering problems efficiently. While giving these models more time to think (test-time compute) can help, researchers aren't sure about the best way to use this extra time. Also, dealing with large amounts of code can be challenging for AI models.

What's the solution?

The researchers created CodeMonkeys, a system that lets AI models repeatedly edit code and test it. It tries many different solutions for each problem and uses a clever voting system to pick the best one. CodeMonkeys also has a smart way of figuring out which parts of a big codebase are important, saving time and resources. They tested their system on a dataset of real GitHub issues and found it could solve 57.4% of the problems, spending about $2,300 in the process.

Why it matters?

This research matters because it shows a new way to make AI better at solving real-world coding problems without just making the AI models bigger. By using time and resources more efficiently, CodeMonkeys can tackle complex software issues that current AI struggles with. This could lead to faster and more effective ways of fixing bugs and improving software, potentially saving developers a lot of time and effort. The fact that the researchers are sharing all their code and data means other scientists can build on this work, possibly leading to even better AI coding assistants in the future.

Abstract

Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys.

View Paper