IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li
2026-03-13
Summary
This paper introduces a new technique called IndexCache to make large language models, specifically those using a method called sparse attention, run faster and cheaper without losing quality.
What's the problem?
Large language models are getting really good at handling long pieces of text, but this requires a lot of computing power. A common technique to improve speed, sparse attention, involves focusing on only the most important parts of the text. However, a key part of sparse attention, the 'indexer' which finds those important parts, is still slow and repeats the same calculations over and over again in different layers of the model, even though the results are often very similar between layers.
What's the solution?
IndexCache solves this by strategically choosing a few layers to do the full indexing work, and then having the other layers simply reuse those results. Think of it like having a few dedicated researchers who do all the heavy lifting, and then everyone else uses their findings. The researchers are chosen carefully, and the paper explores two ways to do this: one that quickly tests different layer combinations without changing the model itself, and another that slightly adjusts the researchers to be even more accurate when serving the other layers.
Why it matters?
This work is important because it significantly speeds up large language models – up to 1.82 times faster for initial processing and 1.48 times faster for generating text – while maintaining the same level of accuracy. This means these models can be used more efficiently, reducing costs and making them more accessible for a wider range of applications, even on very large models like GLM-5.
Abstract
Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from O(L^2) to O(Lk). However, the indexer itself retains O(L^2) complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82times prefill speedup and 1.48times decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).