Revisiting Long-context Modeling from Context Denoising Perspective
Zecheng Tang, Baibei Ji, Juntao Li, Lijun Wu, Haijia Gui, Min Zhang
2025-10-09
Summary
This paper investigates a weakness in large language models (LLMs) that are designed to handle very long pieces of text, and proposes a way to make them more reliable.
What's the problem?
While LLMs are getting better at processing long texts, they often get distracted by irrelevant information within that text – think of it like trying to focus on a key sentence in a long document filled with unnecessary details. This 'noise' in the context can throw off the model's attention and lead to incorrect predictions. The paper specifically looks at *how* this noise affects the model's decision-making process.
What's the solution?
The researchers developed a method to identify and measure this 'context noise' using something called the Integrated Gradient (IG) score. Essentially, this score highlights which parts of the long text are actually important versus which parts are just distracting. Then, they created a training technique called Context Denoising Training (CDT) which teaches the model to pay more attention to the important parts and ignore the noise during training. This helps the model learn to focus on what truly matters.
Why it matters?
This research is important because it shows a way to significantly improve the performance of LLMs on tasks that require processing long texts. The CDT training method is relatively simple to implement, yet it allows a smaller, open-source model to perform almost as well as a much larger, more powerful model like GPT-4o, making advanced language processing more accessible.
Abstract
Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent research reveals that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens, that can mislead model attention. In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, the Integrated Gradient (IG) score, to detect and quantify the noise information within the context. Our findings reveal that even simple mitigation of detected context noise can substantially boost the model's attention on critical tokens and benefit subsequent predictions. Building on this insight, we propose Context Denoising Training (CDT), a straightforward yet effective training strategy that improves attention on critical tokens while reinforcing their influence on model predictions. Extensive experiments across four tasks, under both context window scaling and long-context alignment settings, demonstrate the superiority of CDT. Notably, when trained with CDT, an open-source 8B model can achieve performance (50.92) comparable to GPT-4o (51.00).