MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Shieh

2024-10-18

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Summary

This paper presents MixEval-X, a new benchmark designed to improve how AI models are evaluated by allowing any type of input and output, ensuring that evaluations reflect real-world conditions.

What's the problem?

Current evaluation methods for AI models often have inconsistent standards because different communities use various protocols. Additionally, these evaluations can be biased, meaning they don't accurately measure how well models perform in real-world situations. This inconsistency and bias can lead to unreliable assessments of AI capabilities.

What's the solution?

To address these issues, the authors created MixEval-X, which standardizes evaluations across different types of inputs and outputs. They developed methods to mix various benchmarks and adapt them to better represent real-world tasks. This ensures that the evaluations are relevant and can generalize well to actual use cases. Their extensive testing shows that MixEval-X aligns closely with real-world evaluations, achieving a strong correlation with crowd-sourced results.

Why it matters?

This research is important because it provides a more reliable way to evaluate AI models, which is crucial for their development and deployment in real-world applications. By improving evaluation methods, MixEval-X can help ensure that AI systems are better understood and more effective in diverse environments, ultimately leading to advancements in technology that benefit society.

Abstract

Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any real-world benchmark designed to optimize and standardize evaluations across input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions and the model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98). We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

View Paper