To Code, or Not To Code? Exploring Impact of Code in Pre-training
Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker
2024-08-21
Summary
This paper explores the impact of including code in the pre-training data of language models (LLMs) and how it affects their performance on various tasks beyond just coding.
What's the problem?
While many practitioners believe that using code in training helps improve LLMs, there hasn't been enough research to understand exactly how much it benefits the model's ability to perform non-coding tasks. This lack of clarity makes it hard to know whether including code is truly helpful for improving overall model performance.
What's the solution?
The authors conducted a detailed study to analyze how code data in the pre-training phase influences the performance of LLMs across different tasks, such as natural language reasoning and world knowledge. They tested multiple models of varying sizes and found that adding code significantly improved performance in various areas, leading to better results compared to models trained only on text.
Why it matters?
This research is important because it highlights the value of including code in training datasets for language models. By showing that code can enhance performance across a range of tasks, it encourages developers to invest in high-quality code data, ultimately leading to more capable and versatile AI systems.
Abstract
Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.