Multi-Vector Index Compression in Any Modality
Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz, Benjamin Van Durme
2026-02-25
Summary
This paper focuses on making it faster and more efficient to search through large collections of information, like documents containing text, images, and videos, by improving how these collections are stored and accessed.
What's the problem?
When you search through a long document – think a video with lots of scenes or a document with many images – current search methods have to look at every part of that document, which takes a lot of computing power and storage space. This becomes a huge problem when dealing with things like videos and images because they naturally have a lot of data. The longer the document, the slower and more expensive the search becomes.
What's the solution?
The researchers came up with several ways to compress how documents are represented for searching, without losing too much accuracy. They tested four main techniques: resizing the document’s representation, using 'memory tokens' to remember important parts, grouping similar parts together in a hierarchy, and a new method called 'attention-guided clustering'. This last method is the most important; it figures out which parts of a document are most meaningful and focuses on those when searching, essentially creating a smart summary for quick access. It uses 'attention' to weigh different parts of the document based on how important they are.
Why it matters?
This work is important because it makes searching through large, complex files like videos and image collections much more practical. By reducing the storage and computing costs, it opens the door to building better search engines for all kinds of multimedia content, allowing us to quickly find what we need without waiting a long time or using excessive resources.
Abstract
We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: github.com/hanxiangqin/omni-col-press.