TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

Yuxuan Xie, Tianhua Li, Wenqi Shao, Kaipeng Zhang

2024-10-24

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

Summary

This paper presents TP-Eval, a new evaluation framework designed to improve how we assess multimodal large language models (MLLMs) by customizing prompts for different models.

What's the problem?

Evaluating MLLMs can be tricky because small changes in the prompts (the questions or instructions given to the models) can lead to big differences in their performance. This means that using the same prompt for different models can create unfair evaluations and may not accurately reflect their abilities.

What's the solution?

TP-Eval addresses this issue by customizing prompts specifically for each model. It rewrites original prompts into tailored versions that better suit the strengths of different MLLMs. The framework includes modules for adjusting prompts based on the model's needs, which helps reduce biases in evaluation and reveals the true capabilities of each model.

Why it matters?

This research is important because it helps create fairer and more accurate evaluations of MLLMs. By improving how we assess these models, researchers can better understand their strengths and weaknesses, leading to more effective development of AI technologies that can handle complex tasks across various domains.

Abstract

Recently, multimodal large language models (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity - minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models' capabilities, underestimating the models' performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models' potential. TP-Eval will rewrite the original prompts to different customized prompts for different models. In particular, we propose some well-designed modules for prompt customization tailored to the scenario of MLLM evaluation. Extensive experiments demonstrate the effectiveness of our approach to uncovering models' capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks.

View Paper