NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu

2025-12-16

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Summary

This paper introduces a new way to test how well AI coding assistants can build entire software projects from scratch, not just write small pieces of code.

What's the problem?

Currently, tests for AI coding tools mostly focus on simple tasks like finishing a line of code or fixing a small bug. These tests don't show if an AI can actually *plan* and *execute* the many steps needed to create a complete, working software library, which requires consistent thinking over a long period of time. We didn't have a good way to measure if these AI agents could handle the complexity of a real-world software project.

What's the solution?

The researchers created a benchmark called NL2Repo Bench. This benchmark gives an AI only a description of what a software library should do and an empty folder. The AI then has to independently figure out the library's structure, manage different parts of the code, write all the necessary code, and make sure the library can be installed and used. They tested several state-of-the-art AI models with this benchmark.

Why it matters?

The results showed that even the best AI coding assistants still struggle to build complete software projects. They often get stuck, lose track of the overall goal, or create code that doesn't work well together. This research highlights that building AI that can truly autonomously develop software requires significant improvements in long-term planning and reasoning abilities, and provides a standard test to measure progress in this area.

Abstract

Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.

View Paper