Scaling Laws for Code: Every Programming Language Matters

Jian Yang, Shawn Guo, Lin Jing, Wei Zhang, Aishan Liu, Chuan Hao, Zhoujun Li, Wayne Xin Zhao, Xianglong Liu, Weifeng Lv, Bryan Dai

2025-12-24

Scaling Laws for Code: Every Programming Language Matters

Summary

This research investigates how well large language models for code perform, and how to train them most effectively, considering that code comes in many different programming languages.

What's the problem?

Currently, predicting how well a code-generating AI will perform is difficult because it doesn't account for the fact that some programming languages are easier to learn for the AI than others. Existing research treats all languages the same, ignoring that real-world software projects often use multiple languages together, and that some languages work well together.

What's the solution?

The researchers ran over a thousand training experiments with different model sizes, amounts of data, and programming languages (like Python, Rust, and JavaScript). They found that languages like Python benefit more from larger models and more data than languages like Rust. They also discovered that training the AI on multiple languages at once, especially those with similar structures, improves performance. Finally, they developed a strategy for deciding how much data to use for each language during training, prioritizing languages like Python and those that work well together, leading to better overall results.

Why it matters?

This work is important because it provides a more accurate way to predict the performance of code-generating AIs and helps developers train them more efficiently. By understanding how different languages impact training, and how to combine them effectively, we can build better tools for software development and make AI more useful in real-world coding projects.

Abstract

Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.

View Paper