Long Code Arena: a Set of Benchmarks for Long-Context Code Models

Egor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov, Evgeniy Glukhov, Anton Shapkin, Maria Tigina, Yaroslav Golubev, Alexander Kovrigin, Arie van Deursen, Maliheh Izadi, Timofey Bryksin

2024-06-20

Long Code Arena: a Set of Benchmarks for Long-Context Code Models

Summary

This paper introduces the Long Code Arena, a new set of benchmarks designed to test how well AI models can understand and generate code when they have access to large amounts of context, such as entire projects instead of just single files.

What's the problem?

As AI models for coding have improved, they are now capable of processing much larger amounts of information. However, most existing benchmarks only test these models on small snippets of code or single functions, which doesn't reflect real-world programming scenarios where developers often work with large codebases that require understanding the relationships between many files.

What's the solution?

To address this issue, the authors created the Long Code Arena, which includes six different benchmarks that cover various coding tasks requiring a broader context. These tasks include generating code using libraries, fixing continuous integration builds, completing code at the project level, generating commit messages, locating bugs, and summarizing modules. Each benchmark comes with a carefully checked dataset for testing and evaluation tools to help researchers use these benchmarks effectively.

Why it matters?

This research is important because it fills a gap in evaluating AI models for coding tasks by focusing on long-context scenarios. By testing models in more realistic situations, the Long Code Arena can help identify their strengths and weaknesses in handling complex coding challenges. This will ultimately lead to the development of better AI tools for software development, making it easier and more efficient for programmers to work on large projects.

Abstract

Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.

View Paper