LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding
Jia Li, Xuyuan Guo, Lei Li, Kechi Zhang, Ge Li, Jia Li, Zhengwei Tao, Fang Liu, Chongyang Tao, Yuqi Zhu, Zhi Jin
2025-03-10
Summary
This paper talks about LONGCODEU, a new way to test how well AI models understand long pieces of computer code
What's the problem?
Current AI models that work with code claim they can handle really long programs, but there's no good way to check if they actually understand all that code. This makes it hard to know if these AI models are ready for real-world programming tasks
What's the solution?
The researchers created LONGCODEU, a test with eight different tasks that check how well AI models understand long code in four important ways. They used this test on nine popular AI models to see how they performed. The test looks at things like how well the AI can spot different parts of the code, understand how those parts work together, and make sense of the documentation
Why it matters?
This matters because it shows that current AI models struggle with really long code, especially when it's more than 32,000 characters long. This is way less than what the AI companies claim their models can handle. By pointing out these problems, the research helps developers make better AI models for working with code, which could lead to more useful tools for programmers in the future
Abstract
Current advanced long-context language models offer great potential for real-world software engineering applications. However, progress in this critical domain remains hampered by a fundamental limitation: the absence of a rigorous evaluation framework for long code understanding. To gap this obstacle, we propose a long code understanding benchmark LONGCODEU from four aspects (8 tasks) to evaluate LCLMs' long code understanding ability required for practical applications, including code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long code documentation understanding. We evaluate 9 popular LCLMs on LONGCODEU (i.e., 6 general models and 3 code models). Our experimental results reveal key limitations in current LCLMs' capabilities for long code understanding. Particularly, the performance of LCLMs drops dramatically when the long code length is greater than 32K, falling far short of their claimed 128K-1M context windows. In the four aspects, inter-code unit relation understanding is the most challenging for LCLMs. Our study provides valuable insights for optimizing LCLMs and driving advancements in software engineering.