IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, Bin Cui
2024-10-10

Summary
This paper introduces IterComp, a new framework that improves text-to-image generation by combining the strengths of different advanced models through iterative feedback learning.
What's the problem?
Current text-to-image models, like RPG and Stable Diffusion, have different strengths when it comes to generating images based on text. Some models are better at understanding how to combine attributes (like color and shape), while others excel in managing spatial relationships (how objects are arranged in an image). This inconsistency makes it hard to achieve high-quality compositional images that accurately reflect complex prompts.
What's the solution?
To solve this problem, IterComp aggregates the preferences of multiple models and uses an iterative feedback learning approach. The authors created a collection of six powerful open-source models and evaluated them based on three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. They then developed a dataset that helps train reward models based on how well these models perform. Using this dataset, IterComp allows for continuous improvement of both the image generation model and the reward models over several iterations.
Why it matters?
This research is significant because it enhances the ability of AI to generate high-quality images from text prompts by leveraging the best features of multiple models. By improving compositional generation, IterComp can lead to better applications in fields like art creation, advertising, and any area where visual content is generated from descriptions. This could ultimately make AI-generated images more realistic and useful.
Abstract
Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation. Code: https://github.com/YangLing0818/IterComp