Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction
Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
2025-08-05
Summary
This paper talks about Sparse-dLLM, a method that makes big diffusion-based language models run faster and more efficiently by managing memory smartly and focusing attention only where it's needed.
What's the problem?
The problem is that large diffusion language models require a lot of computer memory and processing power, which slows them down and makes them hard to run quickly, especially for longer tasks.
What's the solution?
Sparse-dLLM solves this by using a technique called dynamic cache eviction, which removes unnecessary data from memory while the model is running, and sparse attention, which helps the model pay attention only to important parts, boosting speed and efficiency without losing quality.
Why it matters?
This matters because it allows powerful language models to work faster and handle bigger tasks more efficiently, making AI tools that use these models more practical and useful in real-world applications.
Abstract
Sparse-dLLM improves the efficiency of diffusion large language models by implementing dynamic cache eviction and sparse attention, enhancing throughput without compromising performance.