BenTo: Benchmark Task Reduction with In-Context Transferability

Hongyu Zhao, Ming Li, Lichao Sun, Tianyi Zhou

2024-10-18

BenTo: Benchmark Task Reduction with In-Context Transferability

Summary

This paper introduces BenTo, a new method for reducing the number of tasks used to evaluate large language models (LLMs) while still maintaining the quality of the evaluation.

What's the problem?

Evaluating large language models can be very expensive and time-consuming because it often involves testing them on a wide range of tasks. Current methods do not efficiently reduce the number of tasks, which can lead to unnecessary costs and effort without improving the evaluation outcomes.

What's the solution?

To address this issue, the authors developed BenTo, which uses a technique called in-context learning to estimate how transferable tasks are to one another. This allows them to identify a smaller set of tasks that still represent the full range of evaluations needed for LLMs. By analyzing task similarities, they can reduce the number of tasks in benchmarks like MMLU or FLAN down to just 5% while only causing a small difference (less than 4%) in evaluation results compared to using all tasks.

Why it matters?

This research is important because it makes the evaluation of AI models more efficient and cost-effective. By reducing the number of tasks required for testing without sacrificing quality, BenTo can help researchers and developers save time and resources while still ensuring that their models are accurately assessed. This advancement could lead to faster improvements in AI technology and better performance in real-world applications.

Abstract

Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a <4% difference to the evaluation on the original benchmark. Compared to prior works, our method is training-free, gradient-free, and highly efficient requiring ICL only.

View Paper