WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, Li Yuan

2025-03-11

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image
Generation

Summary

This paper talks about WISE, a new way to test AI image generators by checking if they can create pictures that match real-world knowledge, like historical facts or science concepts, not just look real.

What's the problem?

Current ways to test AI image tools mostly check if the image looks real or matches basic text descriptions, but don’t test if the image makes sense with things like history, geography, or science.

What's the solution?

WISE uses 1,000 tricky prompts (like 'a medieval knight using a smartphone') and a new scoring system (WiScore) that grades images on accuracy, realism, and how good they look, using AI judges to spot mistakes.

Why it matters?

This helps improve AI tools so they can create images that are not just pretty but also factually correct, which is crucial for education, news, or any field needing accurate visuals.

Abstract

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose WISE, the first benchmark specifically designed for World Knowledge-Informed Semantic Evaluation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce WiScore, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at https://github.com/PKU-YuanGroup/WISE.

View Paper