Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan

2025-11-04

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Summary

This paper introduces a new way to combine results from different search methods, specifically when dealing with complex data like videos. It focuses on improving how we find the *right* video when multiple search approaches give us different options.

What's the problem?

When you're searching for something, like a video, you often use multiple search engines or methods. Each one gives you a list of potential matches. Combining these lists effectively is hard, especially with videos where you have both visual content and text descriptions. Existing methods just look at how highly ranked something is by each search, ignoring the actual content of the videos themselves and how well the different searches agree.

What's the solution?

The researchers developed a system called Vote-in-Context (ViC) that uses a powerful AI model called a Vision-Language Model (VLM). Instead of just looking at rankings, ViC *shows* the AI model information about each video candidate – like a grid of images from the video and its subtitles – along with which searches found it. The AI then 'reasons' about the videos and decides which one is most likely the correct answer, effectively 'voting' based on both the content and the search results. They also created a way to neatly present the video information to the AI, called S-Grid.

Why it matters?

This work is important because it significantly improves video search results without needing any extra training. It's a 'zero-shot' method, meaning it works well on new types of videos and searches right away. It sets a new standard for how well computers can understand and combine information from different sources to find what you're looking for, and it’s a simple way to make existing AI models much better at searching.

Abstract

In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC

View Paper