MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bohan Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, Wenhu Chen
2024-06-24

Summary
This paper introduces MantisScore, a new system designed to automatically evaluate the quality of generated videos by simulating human feedback.
What's the problem?
As video generation technology has improved, there has been a growing need for reliable ways to measure the quality of these generated videos. However, existing methods for evaluating video quality are not very effective, mainly because there isn't enough large-scale data that has been reviewed and scored by humans. This makes it hard to create accurate automatic metrics that can truly reflect how good a video is.
What's the solution?
To tackle this issue, the authors released VideoFeedback, a large dataset that includes human-provided scores for over 37,000 synthesized videos created by 11 different video generation models. They then developed MantisScore, which is trained on this dataset to automatically assess video quality. Their experiments showed that MantisScore achieves a high correlation (77.1%) with human evaluations, meaning it can effectively mimic how humans rate video quality. This performance is significantly better than previous metrics, making MantisScore a useful tool for evaluating video generation models.
Why it matters?
This research is important because it provides a way to automatically assess video quality in a manner that closely aligns with human judgment. By improving how we evaluate generated videos, MantisScore can help developers track the progress of their models and enhance the quality of AI-generated content. This is crucial in fields like entertainment, education, and marketing, where high-quality video content is essential.
Abstract
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train MantisScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between MantisScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that MantisScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe MantisScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.