LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang

2025-09-12

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Summary

This paper introduces LoCoBench, a new way to test how well artificial intelligence models, specifically large language models, can understand and work with large amounts of computer code.

What's the problem?

Current methods for evaluating AI code understanding focus on small pieces of code or simple tasks. This doesn't accurately reflect real-world software development, where programmers need to understand entire projects with many files and complex relationships between them. Existing benchmarks don't push AI models to their limits when dealing with these large, complex codebases, leaving a gap in our understanding of their capabilities.

What's the solution?

The researchers created LoCoBench, a benchmark containing 8,000 different coding challenges across 10 popular programming languages. These challenges require the AI to analyze codebases ranging from 10,000 to 1 million 'tokens' – essentially, pieces of code – which is a huge scale. The benchmark tests eight key skills like understanding the overall structure of a project, refactoring code across multiple files, and finding security vulnerabilities. They also developed a scoring system with 17 different measurements to thoroughly evaluate the AI's performance.

Why it matters?

This work is important because it shows that even the most advanced AI models still struggle with truly understanding large and complex software projects. It highlights the need for further research and development to improve AI's ability to assist with real-world software development tasks, and provides a tool for researchers to measure progress in this area.

Abstract

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.

View Paper