Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining
Yuxiang Wei, Hojae Han, Rajhans Samdani
2024-09-05
Summary
This paper talks about Arctic-SnowCoder, a new code model that uses high-quality data to improve the pretraining process for programming tasks.
What's the problem?
Recent research shows that using high-quality data is essential for training language models effectively, but what 'high-quality' means is not clearly defined. In the coding domain, many existing models are trained on large amounts of data without ensuring that the data is actually useful or relevant, leading to subpar performance in real-world coding tasks.
What's the solution?
The authors introduce Arctic-SnowCoder-1.3B, a code model trained on a carefully curated dataset of 555 billion tokens through three phases. The first phase uses a lot of standard-quality code; the second phase focuses on high-quality code selected by an algorithm designed to identify good code; and the third phase generates synthetic data based on the previous high-quality examples. This structured approach allows Arctic-SnowCoder to outperform other models even though it was trained on less data overall.
Why it matters?
This research is important because it highlights how crucial high-quality data is for training effective AI models in coding. By demonstrating that a well-curated dataset can lead to better performance, Arctic-SnowCoder sets a new standard for developing programming language models, which can help improve tools for developers and enhance coding education.
Abstract
Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.