M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation
Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, Ke Jin, Ge Zhang, Zekun Wang, Guoan Zhang, Bangyu Xiang, Wenbo Su, Bo Zheng
2024-11-04

Summary
This paper introduces M2RC-EVAL, a new benchmark designed to evaluate how well code completion models work across multiple programming languages. It aims to improve the performance of large language models (LLMs) in understanding and completing code.
What's the problem?
Most existing benchmarks for evaluating code completion only focus on a few programming languages, making it hard to assess how well these models can handle a variety of languages. Additionally, current benchmarks often provide average scores without considering the specific challenges posed by different coding scenarios, which limits understanding of a model's strengths and weaknesses.
What's the solution?
The authors created M2RC-EVAL, which covers 18 different programming languages and includes detailed annotations for various coding scenarios. This allows for a more thorough evaluation of how well LLMs can complete code in different contexts. They also developed M2RC-INSTRUCT, a dataset that provides instructions to help improve the code completion abilities of these models. Together, these resources enable researchers to better understand and enhance the performance of code LLMs.
Why it matters?
This research is significant because it provides a comprehensive tool for evaluating and improving code completion models, which are crucial for software development. By expanding the range of languages and scenarios tested, M2RC-EVAL helps ensure that LLMs can effectively assist programmers in real-world situations, ultimately leading to better software tools and more efficient coding practices.
Abstract
Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of languages (<5), which cannot evaluate the general code intelligence abilities across different languages for existing code Large Language Models (LLMs). Besides, the existing benchmarks usually report overall average scores of different languages, where the fine-grained abilities in different completion scenarios are ignored. Therefore, to facilitate the research of code LLMs in multilingual scenarios, we propose a massively multilingual repository-level code completion benchmark covering 18 programming languages (called M2RC-EVAL), and two types of fine-grained annotations (i.e., bucket-level and semantic-level) on different completion scenarios are provided, where we obtain these annotations based on the parsed abstract syntax tree. Moreover, we also curate a massively multilingual instruction corpora M2RC- INSTRUCT dataset to improve the repository-level code completion abilities of existing code LLMs. Comprehensive experimental results demonstrate the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT.