Constantly Improving Image Models Need Constantly Improving Benchmarks

Jiaxin Ge, Grace Luo, Heekyung Lee, Nishant Malpani, Long Lian, XuDong Wang, Aleksander Holynski, Trevor Darrell, Sewon Min, David M. Chan

2025-10-21

Constantly Improving Image Models Need Constantly Improving Benchmarks

Summary

This paper introduces a new way to test how good image generation models are, like the one powering GPT-4o's image creation. It focuses on how people are *actually* using these models and what they're asking them to do.

What's the problem?

Currently, the tests used to evaluate image generators don't keep up with how quickly the technology is improving and the creative ways people are using it. Existing tests often miss the complex and new things users are prompting the models to create, so they don't give a complete picture of a model's abilities. This creates a disconnect between what people think a model can do and what the official tests say it can do.

What's the solution?

The researchers created a framework called ECHO that builds tests based on real social media posts showing what people are asking image generators to do. They collected over 31,000 prompts from these posts, specifically focusing on GPT-4o Image Gen. By analyzing these prompts, they found tasks the old tests missed, like translating product labels into different languages or creating realistic receipts with specific amounts. They also used feedback from these posts to develop better ways to measure image quality, looking at things like color, details, and how well the image matches the prompt.

Why it matters?

This work is important because it provides a more realistic and up-to-date way to evaluate image generation models. By using real-world examples, ECHO can better identify the strengths and weaknesses of these models and guide future improvements, ensuring they align with what users actually want and need. It also helps to understand how people are creatively using these tools.

Abstract

Recent advances in image generation, often driven by proprietary systems like GPT-4o Image Gen, regularly introduce new capabilities that reshape how users interact with these models. Existing benchmarks often lag behind and fail to capture these emerging use cases, leaving a gap between community perceptions of progress and formal evaluation. To address this, we present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use: social media posts that showcase novel prompts and qualitative user judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from such posts. Our analysis shows that ECHO (1) discovers creative and complex tasks absent from existing benchmarks, such as re-rendering product labels across languages or generating receipts with specified totals, (2) more clearly distinguishes state-of-the-art models from alternatives, and (3) surfaces community feedback that we use to inform the design of metrics for model quality (e.g., measuring observed shifts in color, identity, and structure). Our website is at https://echo-bench.github.io.

View Paper