Precise Parameter Localization for Textual Generation in Diffusion Models

Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic

2025-02-17

Precise Parameter Localization for Textual Generation in Diffusion
Models

Summary

This paper talks about a new way to make AI image generators better at creating and controlling text within images. The researchers found that only a tiny part of these AI models is actually responsible for generating text in images.

What's the problem?

AI models that create images with text in them are really big and use a lot of computing power. It's hard to improve how they handle text without messing up the whole system or using even more resources.

What's the solution?

The researchers discovered that less than 1% of the AI model's parts, all found in something called attention layers, control how text appears in images. They used this knowledge to make targeted improvements to just these small parts. They fine-tuned only these specific layers to make the text generation better, created ways to edit text in already-generated images, and even found a way to stop the AI from generating harmful text without changing how it works overall.

Why it matters?

This matters because it makes AI image generators more efficient and gives us more control over the text in AI-created images. It could lead to better, faster, and safer AI tools for creating images with text, which could be useful in fields like advertising, education, and social media. It also shows that we can make big improvements to AI by focusing on small, crucial parts instead of changing everything.

Abstract

Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at https://t2i-text-loc.github.io/.

View Paper