How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data
Yejie Wang, Keqing He, Dayuan Fu, Zhuoma Gongque, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, Jingang Wang, Mengdi Zhang, Xunliang Cai, Weiran Xu
2024-09-09

Summary
This paper talks about improving how code models are trained by using high-quality data for instruction tuning, which helps them perform better on coding tasks.
What's the problem?
While recent code models have shown good performance on some tests, they struggle with others due to issues like data leakage in their training datasets. This means that the models might be learning from incorrect or misleading examples, leading to poor performance in real-world scenarios.
What's the solution?
To tackle this problem, the authors propose a method for selecting high-quality training data by focusing on three key aspects: how complex the instructions are, the quality of the model's responses, and the variety of instructions. They created a new set of models called XCoder, which are fine-tuned from an existing model (LLaMA3) using this carefully selected data. Their experiments show that XCoder achieves better results with less training data compared to other models.
Why it matters?
This research is important because it helps improve the reliability and effectiveness of code models used in programming tasks. By ensuring that these models are trained on high-quality data, developers can create tools that are more accurate and useful for writing code, ultimately making programming easier and more efficient.
Abstract
Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in https://github.com/banksy23/XCoder