DoPE: Denoising Rotary Position Embedding
Jing Xiong, Liyang Fan, Hui Shen, Zunhai Su, Min Yang, Lingpeng Kong, Ngai Wong
2025-11-17
Summary
This paper focuses on improving how Transformer models, the backbone of many large language models, handle very long pieces of text. It introduces a new technique called Denoising Positional Encoding (DoPE) to help these models understand the relationships between words even when dealing with extremely long sequences.
What's the problem?
Transformer models use something called 'positional encoding' to understand the order of words in a sentence. A common method, RoPE, struggles when the text gets very long, leading to poor performance when trying to understand or generate text beyond the lengths it was originally trained on. Essentially, the model gets confused about which words are important when the context is huge, and this confusion gets worse as the text gets longer. This happens because the way the model pays attention to different words becomes unbalanced, creating what's called an 'attention sink'.
What's the solution?
The researchers realized that the positional encoding creates a kind of 'noisy' pattern in how the model focuses on words. DoPE tackles this by identifying and reducing the noise in this pattern *without* needing to retrain the model. It does this by analyzing the frequency of different patterns and filtering out the ones that are causing problems. They then represent the remaining information using a simple, stable mathematical distribution, making the model's attention more reliable even with very long texts. They show mathematically why the attention sink happens and how their method fixes it.
Why it matters?
This work is important because it offers a simple and effective way to significantly improve the ability of large language models to process and understand very long documents. This is crucial for tasks like summarizing long articles, answering questions about extensive texts, or having coherent conversations over many turns. By improving 'length generalization', DoPE makes these models more practical and useful for real-world applications that require handling large amounts of information.
Abstract
Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: https://The-physical-picture-of-LLMs.github.io