Contextual Document Embeddings

John X. Morris, Alexander M. Rush

2024-10-04

Summary

This paper introduces Contextual Document Embeddings, a new method that improves how documents are represented in machine learning by considering the context of surrounding documents, leading to better retrieval performance.

What's the problem?

Traditional methods for creating document embeddings (which are numerical representations of documents) often treat each document in isolation. This means they don't consider the relationships or similarities between documents, which can lead to losing important contextual information. As a result, when trying to retrieve relevant documents based on a query, these methods may not perform well, especially when the documents have similar content and structure.

What's the solution?

To address this issue, the authors propose two main improvements: First, they introduce a new training method that groups similar documents together so that the model learns to distinguish between them more effectively. This technique is called contrastive learning. Second, they develop a new architecture that allows the embedding model to take into account information from neighboring documents while generating embeddings. This way, the model can create more accurate representations that reflect the context of each document within the larger collection.

Why it matters?

This research is important because it enhances the ability of machine learning systems to retrieve relevant information from large collections of documents. By improving how documents are understood in relation to one another, Contextual Document Embeddings can lead to better results in applications like search engines, recommendation systems, and any technology that relies on understanding and retrieving textual information.

Abstract

Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.

View Paper