LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Roshan Ram, Akshara Prabhakar, Tulika Awalgaonkar, Zixiang Chen, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang

2025-11-18

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

Summary

This paper introduces a new way to test how well large language models (LLMs) can act as software developers, going beyond simple tests to simulate real-world coding projects.

What's the problem?

Current methods for evaluating LLMs on coding tasks are too basic. They usually just give the model one problem at a time and don't test its ability to handle a longer, more complex project that requires multiple steps, using tools, and fixing mistakes as it goes. Existing tests don't accurately reflect how a coding agent would actually work in a real software development environment, especially when dealing with large amounts of code.

What's the solution?

The researchers created LoCoBench-Agent, which builds on an existing coding test called LoCoBench. They turned the single-step problems into interactive scenarios where the LLM acts as an agent, using tools like file managers, search engines, and code analyzers. They then measured how well the agent performs based on nine different factors related to understanding the code and working efficiently, testing it with code projects ranging from 10,000 to 1 million 'tokens' (pieces of code).

Why it matters?

This new testing framework is important because it provides a more realistic and thorough way to evaluate LLMs for software development. By identifying strengths and weaknesses in these models, it helps researchers improve them and move closer to creating AI agents that can truly automate complex coding tasks at a large scale.

Abstract

As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real-world capabilities becomes critical. While existing benchmarks like LoCoBench~qiu2025locobench assess long-context code understanding, they focus on single-turn evaluation and cannot capture the multi-turn interactive nature, tool usage patterns, and adaptive reasoning required by real-world coding agents. We introduce LoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess LLM agents in realistic, long-context software engineering workflows. Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations, tool usage efficiency, error recovery, and architectural consistency across extended development sessions. We also introduce an evaluation methodology with 9 metrics across comprehension and efficiency dimensions. Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens, enabling precise assessment of long-context performance. Through systematic evaluation of state-of-the-art models, we reveal several key findings: (1) agents exhibit remarkable long-context robustness; (2) comprehension-efficiency trade-off exists with negative correlation, where thorough exploration increases comprehension but reduces efficiency; and (3) conversation efficiency varies dramatically across models, with strategic tool usage patterns differentiating high-performing agents. As the first long-context LLM agent benchmark for software engineering, LoCoBench-Agent establishes a rigorous foundation for measuring agent capabilities, identifying performance gaps, and advancing autonomous software development at scale.

View Paper