Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0 Tech Report

Rikiya Takehi, Benjamin Clavié, Sean Lee, Aamir Shakir

2025-10-17

Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0 Tech Report

Summary

This paper introduces a new family of small, efficient models called mxbai-edge-colbert-v0, available in 17 million and 32 million parameter versions, designed for information retrieval.

What's the problem?

Existing information retrieval models are often very large and require significant computing power, making it difficult to use them on devices like phones or laptops. There's a need for smaller models that can still perform well, both for quickly finding information from large datasets and for handling longer pieces of text.

What's the solution?

The researchers created mxbai-edge-colbert-v0 through extensive experimentation and testing different configurations. They focused on building a strong base model that could then be further refined. They also did 'ablation studies' which means they systematically removed parts of the model to see what was most important for performance.

Why it matters?

These new models are important because they offer a good balance between size and performance. They outperform other similar-sized models on standard tests and are particularly good at handling longer texts efficiently, opening the door to running powerful search tools on a wider range of devices.

Abstract

In this work, we introduce mxbai-edge-colbert-v0 models, at two different parameter counts: 17M and 32M. As part of our research, we conduct numerous experiments to improve retrieval and late-interaction models, which we intend to distill into smaller models as proof-of-concepts. Our ultimate aim is to support retrieval at all scales, from large-scale retrieval which lives in the cloud to models that can run locally, on any device. mxbai-edge-colbert-v0 is a model that we hope will serve as a solid foundation backbone for all future experiments, representing the first version of a long series of small proof-of-concepts. As part of the development of mxbai-edge-colbert-v0, we conducted multiple ablation studies, of which we report the results. In terms of downstream performance, mxbai-edge-colbert-v0 is a particularly capable small model, outperforming ColBERTv2 on common short-text benchmarks (BEIR) and representing a large step forward in long-context tasks, with unprecedented efficiency.

View Paper