Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation
Isabela Albuquerque, Ira Ktena, Olivia Wiles, Ivana Kajić, Amal Rannen-Triki, Cristina Vasconcelos, Aida Nematzadeh
2025-11-14
Summary
This paper focuses on the issue of text-to-image models, which are getting better at creating images from text descriptions, but often produce very similar-looking results, lacking variety.
What's the problem?
Current text-to-image models struggle with generating diverse images. While they can create images based on what you ask for, they tend to produce outputs that are too similar to each other, even when there's room for a lot of different interpretations. There wasn't a good way to actually *measure* how diverse these models were, or pinpoint where they were failing to create variety.
What's the solution?
The researchers created a system to evaluate how diverse these models are. They did this by focusing on specific objects or ideas (like 'apple') and the different ways those things can change (like the color of the apple). They designed a way for people to judge the diversity of the images, created a set of prompts to test different concepts, and used statistical tests to compare how different models performed. They also looked at different ways to mathematically represent images to see which methods were best at capturing diversity.
Why it matters?
This work is important because it provides a reliable method for measuring diversity in text-to-image models. This allows researchers to identify which models are best at creating varied images and where improvements are needed. Ultimately, this will lead to better image generation technology that can produce more creative and interesting results.
Abstract
Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.