Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation
Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun
2025-02-14

Summary
This paper talks about a new method called Skrr that makes text-to-image AI models more efficient by reducing the memory they use without affecting the quality of the images they create.
What's the problem?
Text-to-image AI models are really good at turning written descriptions into pictures, but the part that understands the text uses way too much computer memory - up to eight times more than the part that actually makes the image. This makes these models hard to use on devices with limited memory.
What's the solution?
The researchers created Skrr, which stands for Skip and Re-use layers. This method looks at the text understanding part of the AI and figures out which parts are repetitive or unnecessary. It then either skips these parts or reuses information from earlier steps, kind of like taking a shortcut. This saves a lot of memory without making the final images any worse.
Why it matters?
This matters because it could make text-to-image AI more accessible, allowing it to run on smartphones or less powerful computers. It also means these AIs could handle longer or more complex text descriptions without needing super expensive hardware. By making these models more efficient, Skrr could lead to wider use of text-to-image technology in apps, websites, and creative tools.
Abstract
Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.