Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, Yu Liu
2024-06-18

Summary
This paper discusses how large language models (LLMs) can be used to improve the way text prompts are understood and processed in text-to-image diffusion models. It highlights some challenges and proposes a new framework to enhance this integration.
What's the problem?
While LLMs are great at understanding text, using them directly as prompt encoders in image generation models can actually reduce their effectiveness. This happens because the way LLMs predict the next word doesn't always align with what diffusion models need to create images. Additionally, the structure of these LLMs can introduce biases that affect how they interpret the order of words in prompts, leading to less accurate image generation.
What's the solution?
To address these issues, the authors propose a new framework that optimizes how LLMs encode text prompts for diffusion models. They provide guidelines on how to use LLMs effectively, which helps improve their ability to represent text accurately. The framework also allows for combining multiple LLMs to enhance performance further. They created a model called the LLM-Infused Diffusion Transformer (LI-DiT) that incorporates these improvements and conducted experiments showing that LI-DiT outperforms both open-source and commercial models like Stable Diffusion 3 and DALL-E 3.
Why it matters?
This research is important because it helps bridge the gap between language understanding and image generation, making AI systems more powerful and flexible. By improving how prompts are processed, this work can lead to better quality images generated from text descriptions, enhancing applications in art, design, and content creation.
Abstract
Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The powerful LI-DiT-10B will be available after further optimization and security checks.