ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak-Wing Mak, Keming Wu, Hiu Tung Wong, Yik Chun Ho, Chi Ruan, Zhuofeng Li, I-Sheng Fang

2026-03-31

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Summary

This paper introduces a new way to test how well AI models can create and change images, called ImagenWorld. It's a comprehensive benchmark designed to really push the limits of these image-generating systems.

What's the problem?

Currently, the ways we test image generation models aren't very good. Existing tests often focus on just one thing, like turning text into images, or only work well for certain types of images. They also don't tell us *why* a model fails – just that it *does* fail. This makes it hard to improve these models effectively.

What's the solution?

The researchers created ImagenWorld, a large collection of over 3,600 different image requests covering six common tasks (like creating images from scratch or editing existing ones) and six different styles (like realistic photos, artwork, or screenshots). They then had people carefully review the images created by 14 different AI models, marking specific errors. They also used automated tools to evaluate the images, but importantly, they combined both human feedback and automated scores to get a more complete picture.

Why it matters?

ImagenWorld is important because it provides a much more thorough and detailed way to evaluate image generation models. The results show that models are generally better at creating images than editing them, and they struggle with images containing a lot of text or symbols. This benchmark will help researchers understand the weaknesses of current models and develop better ones in the future, ultimately leading to more reliable and useful image generation technology.

Abstract

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce ImagenWorld, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

View Paper