CodeClash: Benchmarking Goal-Oriented Software Engineering

John Yang, Kilian Lieret, Joyce Yang, Carlos E. Jimenez, Ofir Press, Ludwig Schmidt, Diyi Yang

2025-11-05

CodeClash: Benchmarking Goal-Oriented Software Engineering

Summary

This paper introduces a new way to test how well AI models can write code, moving beyond simple tasks like fixing bugs to more realistic scenarios where the AI has to build and improve code over time to achieve a larger goal.

What's the problem?

Current methods for evaluating coding AI focus on small, isolated problems. Real software development is about working towards bigger objectives, like making a program more popular or efficient, and constantly refining the code. It's a challenge to see if AI can handle this kind of open-ended, iterative development without specific instructions.

What's the solution?

The researchers created 'CodeClash,' a competition where AI models compete against each other. In each round, the AI models modify their code, and then the code is tested in a 'code arena' to see which performs best based on goals like getting the highest score or gathering resources. The AI has to figure out on its own how to improve its code, both in general and compared to what its opponents are doing, by analyzing results and potentially writing tests. They ran many of these competitions with different AI models and different goals.

Why it matters?

This work is important because it shows that even though AI models are getting better at coding, they still struggle with the strategic thinking and long-term planning needed for complex software projects. The 'CodeClash' benchmark provides a new tool for researchers to study and improve AI's ability to develop code autonomously and achieve real-world goals, and the fact that humans consistently beat the AI highlights areas where further development is needed.

Abstract

Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.

View Paper