UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
2025-10-22
Summary
This paper introduces a new way to test how well AI image generators understand and follow instructions given in text, focusing on making the testing more thorough and realistic.
What's the problem?
Current methods for evaluating text-to-image AI aren't very good because they don't cover enough different types of requests or languages, and they only check if the image generally matches the text, not if all the details are correct. They lack the ability to really dig into *how* well the AI understands specific parts of the prompt.
What's the solution?
The researchers created a benchmark called UniGenBench++ which includes 600 prompts covering a wide range of everyday situations and is available in both English and Chinese, in both short and long versions. These prompts are designed to test the AI on 10 main areas and 27 smaller details of understanding. They used a powerful AI model, Gemini-2.5-Pro, to help build the benchmark and also created a separate AI model that can automatically score the image generator's results, making it easier for others to use.
Why it matters?
This work is important because it provides a much more reliable and detailed way to assess the capabilities of text-to-image AI. By identifying the strengths and weaknesses of different AI models, it helps developers improve them and ensures these tools can be used effectively in real-world applications.
Abstract
Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.