Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content

Zicheng Zhang, Tengchuan Kou, Shushi Wang, Chunyi Li, Wei Sun, Wei Wang, Xiaoyu Li, Zongyu Wang, Xuezhi Cao, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai

2025-03-05

Q-Eval-100K: Evaluating Visual Quality and Alignment Level for
Text-to-Vision Content

Summary

This paper talks about Q-Eval-100K, a new dataset and evaluation method for measuring how good AI-generated images and videos look and how well they match the text descriptions used to create them

What's the problem?

Current ways of checking AI-generated images and videos aren't always accurate because they don't have enough human-rated examples to learn from. This makes it hard to tell if an AI is really creating good, relevant content

What's the solution?

The researchers made a huge dataset called Q-Eval-100K with 100,000 AI-generated images and videos, rated by humans for quality and how well they match their text descriptions. They used this to create Q-Eval-Score, a new AI that can judge both how good an image or video looks and how well it fits its description, even for long, detailed text

Why it matters?

This matters because as AI gets better at creating images and videos, we need good ways to check if they're actually high quality and match what people ask for. Q-Eval-100K could help make AI-generated content more reliable and useful, which is important as these technologies become more common in our daily lives

Abstract

Evaluating text-to-vision content hinges on two crucial aspects: visual quality and alignment. While significant progress has been made in developing objective models to assess these dimensions, the performance of such models heavily relies on the scale and quality of human annotations. According to Scaling Law, increasing the number of human-labeled instances follows a predictable pattern that enhances the performance of evaluation models. Therefore, we introduce a comprehensive dataset designed to Evaluate Visual quality and Alignment Level for text-to-vision content (Q-EVAL-100K), featuring the largest collection of human-labeled Mean Opinion Scores (MOS) for the mentioned two aspects. The Q-EVAL-100K dataset encompasses both text-to-image and text-to-video models, with 960K human annotations specifically focused on visual quality and alignment for 100K instances (60K images and 40K videos). Leveraging this dataset with context prompt, we propose Q-Eval-Score, a unified model capable of evaluating both visual quality and alignment with special improvements for handling long-text prompt alignment. Experimental results indicate that the proposed Q-Eval-Score achieves superior performance on both visual quality and alignment, with strong generalization capabilities across other benchmarks. These findings highlight the significant value of the Q-EVAL-100K dataset. Data and codes will be available at https://github.com/zzc-1998/Q-Eval.

View Paper