Parallel Context-of-Experts Decoding for Retrieval Augmented Generation
Giulio Corallo, Paolo Papotti
2026-01-14
Summary
This paper addresses a challenge with a technique called Retrieval Augmented Generation, which combines information retrieval with large language models to improve responses. It introduces a new method to better use multiple documents without slowing down the process.
What's the problem?
When using Retrieval Augmented Generation, there's a tricky balance. If you feed all the retrieved documents into the language model at once, it can reason across them but gets very slow because of the length of the input. If you process each document separately, it's faster, but the model loses the ability to connect information *between* the documents, hindering its reasoning ability.
What's the solution?
The researchers developed a technique called Parallel Context-of-Experts Decoding, or PCed. Instead of trying to make the model directly attend to all documents simultaneously, PCed treats each retrieved document as an individual 'expert'. It then uses a special decoding rule that compares the predictions of each 'expert' document to the model's original understanding, effectively weighing their contributions to the final answer without needing a complex attention mechanism across all documents.
Why it matters?
This work is important because it offers a way to get the best of both worlds in Retrieval Augmented Generation: the ability to reason across multiple documents *and* maintain speed. This could lead to more accurate and efficient AI systems that can effectively use large amounts of information.
Abstract
Retrieval Augmented Generation faces a trade-off: concatenating documents in a long prompt enables multi-document reasoning but creates prefill bottlenecks, while encoding document KV caches separately offers speed but breaks cross-document interaction. We propose Parallel Context-of-Experts Decoding (Pced), a training-free framework that shifts evidence aggregation from the attention mechanism to the decoding. Pced treats retrieved documents as isolated "experts", synchronizing their predictions via a novel retrieval-aware contrastive decoding rule that weighs expert logits against the model prior. This approach recovers cross-document reasoning capabilities without constructing a shared attention across documents.