Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin

2024-09-19

Summary

This paper talks about the Qwen2.5-Coder series, which is an upgraded version of a coding model designed to generate and understand code more effectively than its predecessor.

What's the problem?

Many existing coding models struggle to generate high-quality code or understand complex programming tasks. This can limit their usefulness in real-world applications where accurate and efficient code generation is essential. Additionally, there is a need for models that can handle a wide variety of coding tasks while being easy to use for developers.

What's the solution?

The researchers developed the Qwen2.5-Coder series, which includes two models with different sizes: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. These models are built on a new architecture and trained on a massive dataset of over 5.5 trillion tokens, which includes various types of code and programming-related data. The models have been fine-tuned to perform well across multiple coding tasks, such as code generation, completion, reasoning, and repair, achieving state-of-the-art performance in more than ten benchmarks.

Why it matters?

This research matters because it pushes the boundaries of what AI can do in terms of coding and programming assistance. By providing powerful tools for developers, the Qwen2.5-Coder series can help streamline software development processes, improve code quality, and support a wider range of programming languages and tasks. This can lead to more efficient coding practices and innovation in software development.

Abstract

In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general versatility. The model has been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will not only push the boundaries of research in code intelligence but also, through its permissive licensing, encourage broader adoption by developers in real-world applications.

View Paper