Query-focused and Memory-aware Reranker for Long Context Processing

Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Zheng Lin, Weiping Wang, Jie Zhou

2026-02-25

Query-focused and Memory-aware Reranker for Long Context Processing

Summary

This paper introduces a new way to improve how large language models re-rank search results, making the best results rise to the top more effectively.

What's the problem?

When you ask a large language model a question, it first finds a bunch of potentially relevant documents. Then, it needs to *re-rank* those documents to show you the most useful ones first. Existing methods for re-ranking either look at each document individually, or struggle to consider the relationships between all the documents at once, and often require specific types of training data that aren't always available.

What's the solution?

The researchers developed a reranking system that focuses on how the language model *pays attention* to different parts of the question and each document. Specifically, they train a small model to predict how relevant a document is to a question by looking at the attention scores from certain parts of the larger language model. This approach considers all the documents together, giving a more holistic view, and doesn't need the special training data that other methods require. It's also efficient, working well even with relatively small models.

Why it matters?

This work is important because it provides a more accurate and efficient way to re-rank search results from large language models. It achieves better performance than previous methods on several different types of data, including general knowledge and conversations, and sets a new benchmark for understanding and remembering information during dialogue. This means better search results and more helpful responses from AI systems.

Abstract

Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.

View Paper