LoRACode: LoRA Adapters for Code Embeddings

Saumya Chaturvedi, Aman Chadha, Laurent Bindschaedler

2025-03-10

LoRACode: LoRA Adapters for Code Embeddings

Summary

This paper talks about LoRACode, a new method to make AI models better at understanding and searching through computer code

What's the problem?

Current AI models that work with code aren't great at picking up on the small details and specific language used in different programming languages. The good models that do exist are either not very efficient or cost a lot to use

What's the solution?

The researchers created LoRACode, which uses a technique called Low-Rank Adaptation (LoRA) to fine-tune existing AI models. This method makes the models much more efficient, using only 2% of the usual training data. They can now train these models really fast - like 2 million examples in just 25 minutes using two powerful computers. LoRACode works better than older methods at finding similar code (Code2Code search) and matching text descriptions to code (Text2Code search)

Why it matters?

This matters because it makes it easier and cheaper for programmers to find the code they need. It could help developers work faster and more efficiently, especially when dealing with different programming languages. The technique could also be used to improve other kinds of AI models, not just ones that work with code

Abstract

Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations.

View Paper