Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Fuli Feng
2025-09-09
Summary
This paper introduces a new way to test how well artificial intelligence models can create images from text descriptions, focusing on two key abilities: putting things together in a scene as described (composition) and figuring out things that aren't directly stated but are implied (reasoning).
What's the problem?
Current tests for text-to-image models aren't good enough to really see how well they're doing, especially as these models get more advanced. Existing benchmarks are too simple, with not much going on in the images and only requiring basic understanding. They don't challenge the models to handle complex scenes or make sophisticated inferences like humans do. It's hard to tell if a model is *actually* reasoning or just getting lucky with a simple prompt.
What's the solution?
The researchers created a new, much more challenging benchmark called T2I-CoReBench. This benchmark tests both composition and reasoning in a detailed way. For composition, they look at how well models handle individual objects, their qualities, and how they relate to each other. For reasoning, they use different types of logical thinking – figuring out specific facts, making generalizations, and finding the best explanation. They made the prompts complex, with lots of details, and included checklists to specifically check if the model got each part of the image right. The benchmark contains over a thousand prompts and over thirteen thousand individual questions.
Why it matters?
This new benchmark is important because it provides a more realistic and thorough way to evaluate text-to-image models. The tests show that while models are okay at putting simple scenes together, they still struggle with complex scenes and, even more so, with understanding what's *implied* in the text. This highlights a major weakness in current AI and points to areas where future research needs to focus to create truly intelligent image generation.
Abstract
Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, thereby corresponding to two core capabilities: composition and reasoning. However, with the emerging advances of T2I models in reasoning beyond composition, existing benchmarks reveal clear limitations in providing comprehensive evaluations across and within these capabilities. Meanwhile, these advances also enable models to handle more complex prompts, whereas current benchmarks remain limited to low scene density and simplified one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent complexities of real-world scenarios, we curate each prompt with high compositional density for composition and multi-step inference for reasoning. We also pair each prompt with a checklist that specifies individual yes/no questions to assess each intended element independently to facilitate fine-grained and reliable evaluation. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 27 current T2I models reveal that their composition capability still remains limited in complex high-density scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts. Our project page: https://t2i-corebench.github.io/.