Adaptive Caching for Faster Video Generation with Diffusion Transformers

Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Michael S. Ryoo, Tian Xie

2024-11-05

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Summary

This paper introduces Adaptive Caching (AdaCache), a new method to speed up video generation using diffusion transformers. It focuses on improving the efficiency of generating high-quality videos while maintaining their visual consistency.

What's the problem?

Generating high-quality videos can take a lot of time and computing power, especially when the videos are long. Current models, like diffusion transformers, often slow down because they rely on complex calculations and large amounts of data, making it difficult to produce videos quickly without sacrificing quality.

What's the solution?

The authors propose AdaCache, which speeds up the video generation process by caching or saving certain computations that can be reused later. They recognize that not all videos need the same amount of processing; some can be generated with fewer steps. AdaCache uses a tailored caching schedule for each video to balance quality and speed. Additionally, they introduce a technique called Motion Regularization (MoReg) that helps the model allocate its computing resources based on how much motion is in the video, ensuring that more complex scenes get the attention they need.

Why it matters?

This research is important because it addresses a major challenge in video generation technology. By making the process faster without losing quality, AdaCache can improve applications in entertainment, virtual reality, and any field where high-quality video content is needed. This advancement could lead to more efficient tools for creators and developers, allowing them to produce better videos in less time.

Abstract

Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.

View Paper