IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li

2025-01-24

IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

Summary

This paper talks about IMAGINE-E, a new way to test and compare AI models that can create images from text descriptions. It's like creating a standardized exam for these AI artists to see how good they are at different types of drawing tasks.

What's the problem?

AI models that turn text into images (called T2I models) are getting really good at making all sorts of pictures, and even doing things like editing videos or understanding 3D spaces. But we don't have a good way to test all these new abilities. It's like having a bunch of super-talented artists, but no way to fairly judge their work across different types of art.

What's the solution?

The researchers created IMAGINE-E, which is like a big art contest for AI. They tested six of the best T2I models on five different types of tasks, including how realistic the images look, how well they follow specific instructions, and how creative they can be with different styles. They paid special attention to how well the AI could handle tricky requests and create images for specific fields, like medicine or architecture.

Why it matters?

This matters because as AI gets better at creating images, we need to know which ones are the best for different jobs. IMAGINE-E helps us understand what each AI is good at, which could help developers make them even better. It's also important for people who might use these AIs, like artists or designers, to know which tool is best for their needs. By showing that some AIs are getting really good at specific tasks, this study suggests that we might be moving towards AI that can be used for all sorts of image-related jobs, not just making pretty pictures.

Abstract

With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.

View Paper