Efficient Code Embeddings from Code Generation Models
Daria Kryvosheieva, Saba Sturua, Michael Günther, Scott Martens, Han Xiao
2025-09-01
Summary
This paper introduces jina-code-embeddings, a new set of models that are really good at understanding the meaning of code and connecting it to natural language, like English. It's designed to help you find code based on what you *want* it to do, answer questions about code, and find code that does the same thing even if it's written in a different programming language.
What's the problem?
Traditionally, it's been difficult for computers to understand code the same way humans do. We want to be able to ask a question in plain English and have the computer find the relevant code, or identify code that performs a similar function, but existing methods weren't very effective at bridging the gap between human language and programming languages. Existing code embedding models were either too large or didn't perform well enough.
What's the solution?
The researchers created jina-code-embeddings using a special type of neural network called an autoregressive model. This model was initially trained on a huge amount of both text *and* code, so it already had a good understanding of both. Then, they fine-tuned it specifically for understanding code relationships. A key technique they used was 'last-token pooling,' which essentially summarizes the code's meaning based on the final part of the code the model processes. They focused on building smaller, but highly effective models.
Why it matters?
This work is important because it provides a more efficient and accurate way to search for, understand, and reuse code. This can save developers a lot of time and effort, and it can also help to improve the quality of software by making it easier to find and incorporate existing, well-tested code. The fact that these models are relatively small is also a big plus, as it means they can be used in more places and require less computing power.
Abstract
jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.