CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang
2025-11-26
Summary
This paper introduces a new method called CLaRa to improve how large language models use information retrieved from external sources to answer questions. It focuses on making the process of finding and using that information more efficient and accurate.
What's the problem?
Large language models are getting better, but when they try to answer questions using lots of outside information, they run into problems. First, processing very long pieces of text slows things down. Second, the steps of *finding* the relevant information and then *using* it to generate an answer are usually optimized separately, which isn't ideal because they really affect each other. Essentially, current systems struggle to handle large amounts of information effectively and don't fully connect the retrieval and generation processes.
What's the solution?
CLaRa tackles these issues by compressing the retrieved information into a more manageable form while still preserving its meaning. It does this using a technique called SCP, which creates these compressed summaries using question-answering and rephrasing. Then, CLaRa trains the parts of the system that find information (the reranker) and the part that writes the answer (the generator) *together*, using a single training process. This allows the system to learn how to retrieve information that's most helpful for generating good answers, and gradients flow between the two parts so they improve together.
Why it matters?
This research is important because it pushes the boundaries of what's possible with retrieval-augmented language models. By improving both the speed and accuracy of these systems, CLaRa can lead to better question-answering, more informative chatbots, and other applications that rely on accessing and understanding large amounts of knowledge. It even performs better than systems that are specifically fine-tuned for these tasks.
Abstract
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.