M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment

Chuan Cui, Kejiang Chen, Zhihua Wei, Wen Shen, Weiming Zhang, Nenghai Yu

2025-02-25

M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image
Quality Assessment

Summary

This paper talks about M3-AGIQA, a new way to judge the quality of AI-generated images that considers multiple factors and uses advanced AI models to make its assessments

What's the problem?

As AI gets better at creating images, it's becoming harder to judge how good these images really are. We need to look at things like how realistic they look, how well they match what was asked for, and whether they seem authentic. Current methods for judging AI images don't cover all these aspects well enough

What's the solution?

The researchers created M3-AGIQA, which uses powerful AI models that can understand both text and images. It looks at images in multiple rounds, describing what it sees in detail. This helps it judge the image quality, how well it matches the original request, and how authentic it seems. They also trained the system to match human opinions about image quality

Why it matters?

This matters because as AI-generated images become more common in areas like art, advertising, and social media, we need good ways to tell which ones are high quality and which aren't. M3-AGIQA could help ensure that only the best AI images are used, making AI art more reliable and useful in real-world applications

Abstract

The rapid advancement of AI-generated image (AGI) models has introduced significant challenges in evaluating their quality, which requires considering multiple dimensions such as perceptual quality, prompt correspondence, and authenticity. To address these challenges, we propose M3-AGIQA, a comprehensive framework for AGI quality assessment that is Multimodal, Multi-Round, and Multi-Aspect. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) as joint text and image encoders and distills advanced captioning capabilities from online MLLMs into a local model via Low-Rank Adaptation (LoRA) fine-tuning. The framework includes a structured multi-round evaluation mechanism, where intermediate image descriptions are generated to provide deeper insights into the quality, correspondence, and authenticity aspects. To align predictions with human perceptual judgments, a predictor constructed by an xLSTM and a regression head is incorporated to process sequential logits and predict Mean Opinion Scores (MOSs). Extensive experiments conducted on multiple benchmark datasets demonstrate that M3-AGIQA achieves state-of-the-art performance, effectively capturing nuanced aspects of AGI quality. Furthermore, cross-dataset validation confirms its strong generalizability. The code is available at https://github.com/strawhatboy/M3-AGIQA.

View Paper