Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders
Siqi Kou, Jiachun Jin, Zetong Zhou, Ye Ma, Yugang Wang, Quan Chen, Peng Jiang, Xiao Yang, Jun Zhu, Kai Yu, Zhijie Deng
2026-01-16
Summary
This paper introduces a new approach to creating images from text, aiming for more accurate and thoughtful image generation than current methods allow.
What's the problem?
Current text-to-image models, even advanced ones, mostly just translate text into pixels without really *understanding* what the text means or what should be shown in the image. They treat the text as a simple instruction, not something to be reasoned about, leading to images that might be visually appealing but don't always make sense or accurately reflect the prompt's intent.
What's the solution?
The researchers propose a 'think-then-generate' method. This involves first having the language model analyze and rewrite the original text prompt to clarify the desired image content. Then, this revised prompt is used to guide the image creation process. They trained the language model to rewrite prompts effectively and then fine-tuned both the language model and the image generator together to ensure the image accurately reflects the rewritten prompt's meaning, using a system of rewards based on how well the image matches real-world knowledge.
Why it matters?
This work is a step towards creating AI models that can not only generate images but also *reason* about what to create. This is important because it allows for more complex and accurate image generation, moving beyond simple literal interpretations of text and getting closer to creating images that truly reflect the user's vision, potentially leading to AI that can demonstrate understanding and creativity.
Abstract
Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.