Interleaving Reasoning for Better Text-to-Image Generation

Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin

2025-09-09

Interleaving Reasoning for Better Text-to-Image Generation

Summary

This paper introduces a new method called Interleaving Reasoning Generation (IRG) to improve how well AI models create images from text descriptions, aiming to close the gap between current image generation and more advanced systems like GPT-4o.

What's the problem?

While recent AI models are getting better at generating images from text, they still struggle with accurately following complex instructions and preserving all the details requested in the prompt. Existing models often generate images that are good overall, but miss important specifics or don't quite match what the user intended, unlike systems that carefully 'think' through the request before creating the image.

What's the solution?

The researchers developed IRG, which works by having the model first create a text-based 'thought process' to plan the image, then generate an initial image based on that plan. Next, the model 'reflects' on the image, identifying areas for improvement in detail, quality, and how well it matches the original text. It then refines the image based on this reflection. To train this system, they created a new dataset called IRGL-300K and a training process that focuses on both strong initial image creation and high-quality refinement through textual reflection.

Why it matters?

This research is important because it significantly improves the quality and accuracy of AI-generated images. The new method achieves state-of-the-art results on several benchmarks, meaning it creates images that are more visually appealing and better aligned with the original text prompts, bringing AI closer to truly understanding and fulfilling user requests for image creation.

Abstract

Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

View Paper