Unified Thinker: A General Reasoning Modular Core for Image Generation
Sashuai Zhou, Qiang Zhou, Jijin Hu, Hanqing Yang, Yue Cao, Junpeng Ma, Yinchao Ma, Jun Song, Tiezheng Ge, Cheng Yu, Bo Zheng, Zhou Zhao
2026-01-07
Summary
This paper focuses on improving how well AI image generators can follow complex instructions, specifically those requiring logical thinking. Current open-source models aren't as good at this as some closed-source systems, and the researchers believe the issue isn't just about making the image creation itself better, but about improving the AI's ability to *plan* how to create the image.
What's the problem?
Existing AI image generators struggle when given instructions that require reasoning or multiple steps. They can create visually appealing images, but often fail to accurately represent the logical relationships described in the instructions. This creates a gap between understanding what's asked and actually producing the correct image, and open-source models are falling behind the capabilities of some privately developed systems.
What's the solution?
The researchers developed a system called Unified Thinker. It separates the 'thinking' part of image generation from the actual image creation. The 'Thinker' plans out the steps needed to fulfill the instruction, breaking it down into smaller, verifiable actions. Then, it uses a 'Generator' to create the image based on that plan. They trained this system in two stages: first, teaching it *how* to plan, and then using a reward system to refine those plans based on how well the resulting image matches the instruction. This allows improvements to the planning process without needing to retrain the entire image generator.
Why it matters?
This work is important because it addresses a fundamental limitation of current AI image generators. By focusing on reasoning and planning, it moves beyond simply generating visually plausible images and towards creating images that accurately reflect complex ideas and instructions. This could lead to AI systems that are much more useful for tasks requiring precise visual representations, like design, illustration, or even scientific visualization.
Abstract
Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.