Sliding Window Attention Adaptation

Yijiong Yu, Jiale Liu, Qingyun Wu, Huazheng Wang, Ji Pei

2025-12-15

Summary

This paper investigates how to make large language models, which are good at understanding and generating text, work efficiently with very long pieces of text without losing their ability to perform well.

What's the problem?

Large language models use a technique called 'attention' to focus on different parts of the input text. However, this attention process becomes incredibly slow and resource-intensive as the text gets longer because the computational cost increases dramatically. A faster method called 'sliding window attention' exists, but simply switching to it after a model has been trained with the original, slower method causes the model to perform much worse on long texts – it's like changing the rules mid-game.

What's the solution?

The researchers developed a method called 'Sliding Window Attention Adaptation' (SWAA). This isn't one single trick, but a combination of five techniques. They only use the faster 'sliding window' method at the beginning of processing, keep special 'sink' tokens that help with long-range dependencies, mix layers that use the original attention with layers that use the sliding window attention, use a prompting technique called 'chain-of-thought' to help the model reason, and finally, fine-tune the model to adjust to the new attention method. They experimented with different combinations of these techniques to find what works best.

Why it matters?

This work is important because it shows that you *can* adapt existing, powerful language models to handle much longer texts efficiently without needing to retrain them from scratch. This opens the door to using these models for tasks that require processing large amounts of information, like analyzing entire books or long conversations, making them more practical and accessible for a wider range of applications.

Abstract

The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation

View Paper