K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

Zhikai Li, Xuewen Liu, Dongrong Fu, Jianquan Li, Qingyi Gu, Kurt Keutzer, Zhen Dong

2024-08-27

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

Summary

This paper introduces K-Sort Arena, a new platform designed to evaluate generative models more efficiently by using K-wise human preferences instead of traditional pairwise comparisons.

What's the problem?

Evaluating generative models, like those that create images or videos, can be slow and inefficient with existing methods. Traditional approaches require a lot of comparisons between models to determine which one is better, and they can be affected by noise in user preferences, making the results less reliable.

What's the solution?

K-Sort Arena improves this process by allowing multiple models to compete at once (K-wise comparisons) rather than just two at a time. This method takes advantage of how people perceive images and videos more intuitively than text, leading to faster and richer evaluations. The platform also uses advanced techniques like probabilistic modeling and Bayesian updating to enhance the accuracy of the rankings. In tests, K-Sort Arena was found to be 16.3 times faster than the commonly used ELO algorithm for reaching reliable results.

Why it matters?

This research is important because it provides a more efficient way to benchmark generative models, which can help developers quickly identify the best models for creating high-quality content. By improving the evaluation process, K-Sort Arena can support faster advancements in AI technologies related to image and video generation.

Abstract

The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the robustness of the system, we leverage probabilistic modeling and Bayesian updating techniques. We propose an exploration-exploitation-based matchmaking strategy to facilitate more informative comparisons. In our experiments, K-Sort Arena exhibits 16.3x faster convergence compared to the widely used ELO algorithm. To further validate the superiority and obtain a comprehensive leaderboard, we collect human feedback via crowdsourced evaluations of numerous cutting-edge text-to-image and text-to-video models. Thanks to its high efficiency, K-Sort Arena can continuously incorporate emerging models and update the leaderboard with minimal votes. Our project has undergone several months of internal testing and is now available at https://huggingface.co/spaces/ksort/K-Sort-Arena

View Paper