LightCache: Memory-Efficient, Training-Free Acceleration for Video Generation
Yang Xiao, Gen Li, Kaiyuan Deng, Yushu Wu, Zheng Zhan, Yanzhi Wang, Xiaolong Ma, Bo Hui
2025-10-08
Summary
This paper focuses on making video creation with diffusion models faster without needing to retrain the model, which is a big advantage. Diffusion models are powerful for generating videos, but they can be slow and require a lot of computer memory.
What's the problem?
When speeding up the video generation process using techniques that store previously calculated information (caching), the amount of memory used dramatically increases during the later steps of video creation – specifically, the denoising and decoding stages. This memory surge can limit how much you can accelerate the process, and even cause the program to crash if you run out of memory.
What's the solution?
The researchers broke down the video generation process into three parts: encoding, denoising, and decoding. They then developed specific methods to reduce memory usage for each stage. They use a technique called asynchronous cache swapping to manage memory efficiently, break down features into smaller chunks, and cleverly slice the data being decoded. Importantly, they made sure these methods didn't add so much extra processing time that they cancelled out the speed improvements.
Why it matters?
This work is important because it allows for faster and more efficient video generation using diffusion models. By reducing the memory requirements, it makes it possible to create higher-quality videos more quickly, and potentially on less powerful hardware. This opens up possibilities for wider use of these advanced video generation techniques.
Abstract
Training-free acceleration has emerged as an advanced research area in video generation based on diffusion models. The redundancy of latents in diffusion model inference provides a natural entry point for acceleration. In this paper, we decompose the inference process into the encoding, denoising, and decoding stages, and observe that cache-based acceleration methods often lead to substantial memory surges in the latter two stages. To address this problem, we analyze the characteristics of inference across different stages and propose stage-specific strategies for reducing memory consumption: 1) Asynchronous Cache Swapping. 2) Feature chunk. 3) Slicing latents to decode. At the same time, we ensure that the time overhead introduced by these three strategies remains lower than the acceleration gains themselves. Compared with the baseline, our approach achieves faster inference speed and lower memory usage, while maintaining quality degradation within an acceptable range. The Code is available at https://github.com/NKUShaw/LightCache .