C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling
Jin Qin, Zihan Liao, Ziyin Zhang, Hang Yu, Peng Di, Rui Wang
2025-12-24
Summary
This paper introduces C2LLM, which stands for Contrastive Code Large Language Models, a new set of models designed to understand and represent code. They come in two sizes, 0.5 billion and 7 billion parameters, and are built using existing code-focused language models.
What's the problem?
Existing methods for creating code embeddings, which are numerical representations of code used for tasks like searching and understanding, often struggle to capture the full meaning of a code snippet. Many rely on just looking at the very end of the code (the 'end-of-sequence' token) which creates a bottleneck – important information from earlier in the code can be lost. Also, it's often difficult to adjust the size of these embeddings to fit different needs.
What's the solution?
The researchers built C2LLM on top of a strong existing code model called Qwen-2.5-Coder. The key innovation is a new module called 'Pooling by Multihead Attention' or PMA. PMA cleverly uses the model's existing ability to understand code relationships to combine information from *all* parts of the code snippet, not just the end. This avoids the information bottleneck and allows for creating embeddings of different sizes without needing complex adjustments.
Why it matters?
C2LLM achieves state-of-the-art results on a benchmark called MTEB-Code, meaning it's currently the best performing model of its size at understanding code. This is important because better code embeddings can lead to improvements in many areas, like code search, code completion, and detecting bugs. The flexibility in embedding size also makes it more adaptable to various applications.
Abstract
We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.