On Leakage of Code Generation Evaluation Datasets

Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, Matthias Gallé

2024-07-11

On Leakage of Code Generation Evaluation Datasets

Summary

This paper discusses the issue of contamination in code generation evaluation datasets, which can affect how well large language models (LLMs) perform. It identifies three main ways that this contamination can occur and introduces a new dataset designed to help measure these effects.

What's the problem?

The main problem is that many LLMs may have been trained on data that includes code generation test sets, which can lead to biased results. This contamination can happen in several ways: (1) direct data leakage, where the model is trained on the same data used for testing; (2) indirect leakage through synthetic data, where generated examples might be too similar to existing test cases; and (3) overfitting, where models perform well on specific test sets but fail to generalize to new tasks.

What's the solution?

To address this issue, the authors created a new dataset called Less Basic Python Problems (LBPP), which consists of 161 prompts with their associated Python solutions. This dataset is designed to reduce the likelihood of contamination and provide a more reliable way to evaluate code generation capabilities without the biases introduced by previous datasets.

Why it matters?

This research is important because it highlights the challenges of evaluating AI models accurately. By identifying and addressing contamination in training and evaluation datasets, the findings can lead to more trustworthy benchmarks for measuring model performance, ultimately improving the development of more effective AI systems.

Abstract

In this paper we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. Key to our findings is a new dataset of 161 prompts with their associated python solutions, dataset which is released at https://huggingface.co/datasets/CohereForAI/lbpp .

View Paper