The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

Jonathan Katzy, Razvan Mihai Popescu, Arie van Deursen, Maliheh Izadi

2025-01-17

The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

Summary

This paper talks about a new collection of computer code called 'The Heap'. It's like a giant library of clean, unique code samples in many different programming languages that researchers can use to test how well AI language models understand and work with code.

What's the problem?

As AI language models get better at working with code, they need to be trained on huge amounts of it. This has used up most of the available code, making it hard for researchers to find 'fresh' code to test these AI models fairly. It's like if a teacher used every math problem in the textbook to teach the class, then had nothing left to use for the test.

What's the solution?

The researchers created 'The Heap', which is a massive collection of code in 57 different programming languages. They made sure this code wasn't already used in other datasets by removing any duplicates. They also chose code with special licenses that make it less likely to have been used for training AI models. It's like they went out and found a whole new set of math problems that weren't in any textbook, so they could give a fair test.

Why it matters?

This matters because it helps researchers test AI models more accurately. Without 'The Heap', it would be like testing students on problems they've already seen, which doesn't really show if they understand the subject. With this new dataset, researchers can better understand how well AI models truly grasp coding concepts and how they perform on completely new tasks. This could lead to developing better AI systems for coding and understanding programming languages.

Abstract

The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.

View Paper