On Pretraining for Project-Level Code Completion

Maksim Sapronov, Evgeniy Glukhov

2025-10-17

On Pretraining for Project-Level Code Completion

Summary

This paper explores how to best train smaller language models to understand and generate code from entire software projects, also known as repositories.

What's the problem?

Large language models are getting good at writing code, but they often need to see a lot of code examples to learn effectively. Training these models on huge datasets from many code repositories is expensive and requires a lot of computing power. The challenge is to find ways to train smaller models to perform well on complex coding tasks, even with limited data.

What's the solution?

The researchers took an existing 1.5 billion parameter code model called OpenCoder and improved it by training it on a carefully selected set of code from repositories. They increased the amount of code the model could 'see' at once, extending its context window, and experimented with different ways to prepare the code for training. They found that simply adjusting how the model handles the position of code within a file was the most important improvement, and that even training on code one file at a time could be very effective.

Why it matters?

This work shows that you don't necessarily need massive datasets and huge models to achieve good results in code generation. By focusing on efficient training techniques and adapting the model's internal settings, it's possible to build powerful code completion tools with more accessible resources, opening up research to more people and projects.

Abstract

Repository-level pretraining is commonly used to enable large language models for code to leverage codebase-wide context. This enhances their ability to generate accurate and context-aware code completions. In this work, we investigate how different repository-processing strategies affect in-context learning in OpenCoder, a 1.5B-parameter model. We extend its context window from 4,096 to 16,384 tokens by training on additional 1B tokens of curated repository-level data. Despite relying on a smaller dataset than competing models (which often use hundreds of billions of tokens), our model achieves comparable performance on the Long Code Arena benchmark. We find that various repository-processing techniques yield similarly strong results, with the primary gain coming from adapting to a new rotary positional embedding (RoPE) scaling parameter. Finally, we show that a simpler file-level training approach at the original sequence length remains highly effective, opening up repository-level code completion research to settings with more constrained data and compute resources.

View Paper