Do I look like a `cat.n.01` to you? A Taxonomy Image Generation Benchmark

Viktor Moskvoretskii, Alina Lobanova, Ekaterina Neminova, Chris Biemann, Alexander Panchenko, Irina Nikishina

2025-03-14

Do I look like a `cat.n.01` to you? A Taxonomy Image Generation
Benchmark

Summary

This paper explores how well text-to-image AI models can generate pictures that match specific categories or concepts, like those found in a dictionary or encyclopedia.

What's the problem?

While AI can create images from text, it's unclear how well these models understand and visualize structured knowledge, such as the relationships between different categories of things. We don't know if they can accurately represent what a 'cat' looks like, according to its scientific classification.

What's the solution?

The researchers created a test (a benchmark) to evaluate how well AI models can generate images for various categories. They used common-sense concepts, random samples, and AI-generated predictions to assess the models. They also used a new method where another AI (GPT-4) judges the quality of the images.

Why it matters?

This work matters because it helps us understand the limitations of AI in visualizing knowledge and opens the door for automating the creation of visual databases or educational resources.

Abstract

This paper explores the feasibility of using text-to-image models in a zero-shot setup to generate images for taxonomy concepts. While text-based methods for taxonomy enrichment are well-established, the potential of the visual dimension remains unexplored. To address this, we propose a comprehensive benchmark for Taxonomy Image Generation that assesses models' abilities to understand taxonomy concepts and generate relevant, high-quality images. The benchmark includes common-sense and randomly sampled WordNet concepts, alongside the LLM generated predictions. The 12 models are evaluated using 9 novel taxonomy-related text-to-image metrics and human feedback. Moreover, we pioneer the use of pairwise evaluation with GPT-4 feedback for image generation. Experimental results show that the ranking of models differs significantly from standard T2I tasks. Playground-v2 and FLUX consistently outperform across metrics and subsets and the retrieval-based approach performs poorly. These findings highlight the potential for automating the curation of structured data resources.

View Paper