< Explain other AI papers

Attention Is All You Need for KV Cache in Diffusion LLMs

Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen

2025-10-17

Attention Is All You Need for KV Cache in Diffusion LLMs

Summary

This paper focuses on making diffusion large language models, which are powerful AI for generating text and code, faster and more efficient without sacrificing quality.

What's the problem?

Current diffusion language models spend a lot of time recalculating information (called key-value caches) even when that information hasn't really changed much, especially in the early stages of processing. This is a waste of computing power and slows down how quickly the model can generate responses. It's like re-reading the same parts of a book over and over again when you already understand them.

What's the solution?

The researchers developed a technique called Elastic-Cache. This method smartly decides *when* and *where* to update these key-value caches. It notices that information from earlier parts of the text doesn't change as often and can be reused. Also, it focuses on updating the caches in the later, more complex parts of the model, and uses how much attention a token receives to predict how much its cache might have changed. This avoids unnecessary recalculations.

Why it matters?

This work is important because it significantly speeds up diffusion language models – up to 45 times faster in some cases – without making them any less accurate. This makes these powerful models more practical to use for real-world applications like solving math problems or writing code, as it reduces the time and resources needed to get a response.

Abstract

This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant {bf MASK} tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose {bf Elastic-Cache}, a training-free, architecture-agnostic strategy that jointly decides {when} to refresh (via an attention-aware drift test on the most-attended token) and {where} to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: 8.7times on GSM8K (256 tokens), 45.1times on longer sequences, and 4.8times on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput (6.8times on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.