MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, Canyu Chen, Qinghao Ye, Zhihong Zhu, Yuqing Zhang, Jiawei Zhou, Zhuokai Zhao, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

2024-07-09

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Summary

This paper talks about MJ-Bench, a new benchmark designed to evaluate how well multimodal reward models judge the quality of images generated from text descriptions. It focuses on ensuring these models provide accurate and safe feedback.

What's the problem?

The main problem is that text-to-image models, like DALLE-3 and Stable Diffusion, often produce poor quality images or unsafe content due to issues like hallucination (making things up) and bias. To improve these models, we need effective ways to evaluate their outputs and align them with what users expect. However, current methods for assessing these judges are not thorough enough, which can lead to problems in how the models are fine-tuned.

What's the solution?

To address this issue, the authors introduce MJ-Bench, which includes a detailed preference dataset that evaluates multimodal judges based on four key areas: alignment (how well the image matches the text), safety (ensuring the content is appropriate), image quality, and bias (avoiding unfair or harmful representations). They tested various judges, including smaller scoring models and both open-source and closed-source vision-language models (VLMs). The results showed that closed-source models like GPT-4o generally provided better feedback compared to others. They also found that smaller scoring models were better at assessing alignment and quality, while larger VLMs excelled at evaluating safety and bias.

Why it matters?

This research is important because it helps improve the way AI systems generate images from text by providing a better framework for evaluating their outputs. By ensuring that these multimodal judges can accurately assess images, MJ-Bench can lead to safer and higher-quality image generation in applications ranging from art creation to advertising, ultimately benefiting users by providing more reliable AI-generated content.

Abstract

While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language (Likert-scale) than numerical scales. Notably, human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench. All data, code, models are available at https://huggingface.co/MJ-Bench.

View Paper