Scalable In-context Ranking with Generative Models

Nilesh Gupta, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Inderjit Dhillon, Felix Yu

2025-10-08

Scalable In-context Ranking with Generative Models

Summary

This paper introduces a new method called BlockRank to make a recent technique called In-context Ranking (ICR) much faster and more efficient. ICR uses large language models (LLMs) to directly compare a question to a set of documents to find the most relevant ones, but it can become very slow when dealing with many documents.

What's the problem?

The main issue with ICR is that as you give the LLM more documents to compare, the amount of computation it needs to do increases dramatically – specifically, it grows much faster than linearly. This is because of how LLMs process information, paying 'attention' to all parts of the input. When there are lots of documents, this 'attention' process becomes a bottleneck, making it impractical to use ICR with large numbers of potential answers.

What's the solution?

The researchers discovered two key patterns in how LLMs handle documents during ICR. First, the LLM tends to focus intensely *within* each document but pays less attention *between* different documents. Second, certain parts of the question are particularly good at identifying which documents are relevant. BlockRank uses these insights by forcing the LLM to focus more on these key question parts and to treat documents as separate 'blocks,' reducing the overall computational load. They also added a special training step to help the LLM better identify relevant document blocks.

Why it matters?

BlockRank makes ICR much faster and able to handle a significantly larger number of documents – up to 500 – without a major slowdown. This is important because it allows us to use the power of LLMs for information retrieval in situations where speed and scalability are crucial, like searching through large databases or quickly finding answers from a vast collection of articles. It matches or beats the performance of other state-of-the-art methods while being much more efficient.

Abstract

In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that FLARE Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.

View Paper