Redundancy Principles for MLLMs Benchmarks

Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai

2025-01-27

Redundancy Principles for MLLMs Benchmarks

Summary

This paper talks about how to make tests for advanced AI models called Multi-modality Large Language Models (MLLMs) more efficient by reducing unnecessary repetition in these tests.

What's the problem?

As MLLMs are developing quickly, people are creating hundreds of new tests (benchmarks) each year to evaluate them. However, many of these tests end up measuring the same things over and over again. This redundancy wastes time and resources, and might not give us a clear picture of what these AI models can really do.

What's the solution?

The researchers looked at three main ways tests can be repetitive: overlapping in what skills they measure, having too many similar questions, and different tests in the same field asking about the same things. They analyzed how hundreds of MLLMs performed on more than 20 different tests to figure out exactly how much repetition there is. Based on this, they came up with guidelines for making better, more efficient tests in the future.

Why it matters?

This matters because as AI gets more advanced, we need good ways to measure what it can do. By making tests more efficient and less repetitive, we can save time and resources while getting a clearer picture of AI capabilities. This helps researchers focus on improving AI in areas that really matter, rather than just getting better at passing repetitive tests. It also helps companies and users understand what different AI models are truly capable of doing.

Abstract

With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs' performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively.

View Paper