Training-free Regional Prompting for Diffusion Transformers
Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang
2024-11-05

Summary
This paper introduces a method called Training-free Regional Prompting for enhancing the capabilities of diffusion transformers in generating images based on complex text prompts. It focuses on improving how these models understand and create images with multiple objects and attributes.
What's the problem?
While diffusion models have become good at generating images from text, they struggle with long and complicated prompts that describe various objects and their relationships. Existing methods often fail to maintain consistency across different views of the same scene, which can lead to poor image quality and inaccuracies.
What's the solution?
The authors propose a new framework called MVPaint, which uses regional prompting without the need for additional training. This involves manipulating attention within the model to focus on specific areas of the image based on the provided text prompts. By doing so, the model can generate detailed and coherent images that accurately reflect the complex descriptions given in the prompts. They tested this method using a model called FLUX.1, showing that it can effectively handle multi-regional prompts and produce high-quality images.
Why it matters?
This research is significant because it addresses a major limitation in current image generation technology. By allowing models to generate images that are more aligned with complex textual descriptions, this method can improve applications in fields like gaming, virtual reality, and design, where accurate and detailed visuals are crucial.
Abstract
Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.