SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Joseph Liu, Joshua Geddes, Ziyu Guo, Haomiao Jiang, Mahesh Kumar Nandwana

2024-11-19

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Summary

This paper introduces SmoothCache, a new technique designed to speed up the inference process of Diffusion Transformers (DiT) by efficiently caching and reusing outputs from previous computations.

What's the problem?

Diffusion Transformers are powerful models used for generating images, videos, and audio, but they require a lot of computational resources and time to process each frame. This makes them slow and expensive to use, especially when they need to evaluate many layers repeatedly during inference.

What's the solution?

SmoothCache addresses this problem by recognizing that outputs from adjacent layers in the model are often very similar. It uses a technique to cache these outputs so that they can be reused instead of recalculated. By analyzing a small set of data, SmoothCache determines which features can be cached and reused, leading to significant speed improvements—between 8% to 71% faster—while maintaining or even enhancing the quality of the generated content. The method works on various DiT models without needing specific adjustments or additional training.

Why it matters?

This research is important because it makes it easier and faster to use advanced AI models for generating multimedia content. By improving the efficiency of these models, SmoothCache can enable real-time applications, making powerful AI tools more accessible and practical for everyday use in areas like video editing, game development, and more.

Abstract

Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.

View Paper