KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLMs
Yixuan Tang, Yi Yang
2026-01-06
Summary
This paper introduces a new technique called KV-Embedding to improve how large language models (LLMs) are used for tasks that don't involve further training, like understanding the meaning of text.
What's the problem?
Large language models are really good at understanding and generating text, but when you use them 'as is' without training them for a specific task, they have some weaknesses. Because of how they're built, early parts of a text don't get to 'see' the later parts when processing it. Also, they're designed to predict the *next* word, which isn't always the best way to understand the overall meaning of a piece of text – it's more about generation than compression of information.
What's the solution?
The researchers found that the final 'key-value' states within the LLM, which are created when processing a sequence, actually contain a pretty good summary of the entire text. KV-Embedding takes these summaries from each layer of the model and adds them to the beginning of the input. This allows *all* parts of the text to have access to the full context, improving understanding without needing to change the LLM itself. They also figured out a way to automatically choose which layers of the model have the most useful summaries.
Why it matters?
This work is important because it shows a clever way to get more out of existing LLMs without the expensive and time-consuming process of retraining them. It's a more efficient approach to improving performance, and it opens up possibilities for exploring other ways to manipulate the inner workings of these models to make them even better at understanding and representing information.
Abstract
While LLMs are powerful embedding backbones, their application in training-free settings faces two structural challenges: causal attention restricts early tokens from accessing subsequent context, and the next-token prediction objective biases representations toward generation rather than semantic compression. To address these limitations, we propose KV-Embedding, a framework that activates the latent representation power of frozen LLMs. Our method leverages the observation that the key-value (KV) states of the final token at each layer encode a compressed view of the sequence. By re-routing these states as a prepended prefix, we enable all tokens to access sequence-level context within a single forward pass. To ensure model-agnostic applicability, we introduce an automated layer selection strategy based on intrinsic dimensionality. Evaluations on MTEB across Qwen, Mistral, and Llama backbones show that KV-Embedding outperforms existing training-free baselines by up to 10%, while maintaining robust performance on sequences up to 4,096 tokens. These results demonstrate that internal state manipulation offers an efficient alternative to input modification, and we hope this work encourages further exploration of LLM internals for representation learning.