GIR-Bench: Versatile Benchmark for Generating Images with Reasoning
Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen
2025-10-14
Summary
This paper introduces a new way to test how well artificial intelligence models can both understand and create images based on reasoning, going beyond simply recognizing objects in pictures.
What's the problem?
Current AI models that handle both images and text are getting better, but there isn't a good, standardized test to see if they *really* understand what they're doing when they connect images to ideas and generate new images. It's hard to tell if a model is just memorizing patterns or actually reasoning about the visual world and applying logic. Existing tests often rely on AI judging other AI, which can be biased.
What's the solution?
The researchers created a benchmark called GIR-Bench. This benchmark tests models in three ways: first, it checks if a model uses the same knowledge when understanding an image and when creating an image from text. Second, it tests if the model can generate images that follow specific rules or hidden assumptions. Finally, it assesses if the model can edit images in a way that requires multiple steps of reasoning. They designed specific ways to evaluate each task to avoid the problems with AI judging AI.
Why it matters?
This work is important because it provides a more reliable way to measure the true intelligence of these AI models. By focusing on reasoning, it helps researchers identify weaknesses and improve the ability of AI to not just 'see' things, but to actually understand and interact with the visual world in a meaningful way, leading to more advanced and trustworthy AI systems.
Abstract
Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce GIR-Bench, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at https://hkust-longgroup.github.io/GIR-Bench{https://hkust-longgroup.github.io/GIR-Bench}.