ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge

2024-12-13

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Summary

This paper talks about ONEBench, a new benchmarking system designed to evaluate the open-ended capabilities of AI models by allowing users to create custom tests from a large pool of samples.

What's the problem?

Traditional benchmarks for testing AI models are often fixed and limited, making it hard to assess how well these models can handle a variety of tasks in real-world situations. These fixed tests do not adapt to new challenges and may not cover all the capabilities that users are interested in.

What's the solution?

ONEBench solves this problem by combining multiple evaluation datasets into one flexible sample pool. Users can generate their own custom benchmarks based on specific capabilities they want to test. This approach allows for a more diverse assessment of AI models, as it can aggregate different types of samples and reduce biases that might come from using a single dataset. The system also includes algorithms that help combine results from various tests to provide accurate rankings of the models' performances.

Why it matters?

This research is significant because it democratizes the evaluation process for AI models, enabling researchers and developers to tailor assessments to their needs. By allowing for continuous updates and the integration of new tests, ONEBench can keep up with the rapid advancements in AI technology, ensuring that evaluations remain relevant and comprehensive.

Abstract

Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models. To address this, we propose ONEBench(OpeN-Ended Benchmarking), a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. ONEBench allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias. Most importantly, it frames model evaluation as a collective process of selecting and aggregating sample-level tests. The shift from task-specific benchmarks to ONEBench introduces two challenges: (1)heterogeneity and (2)incompleteness. Heterogeneity refers to the aggregation over diverse metrics, while incompleteness describes comparing models evaluated on different data subsets. To address these challenges, we explore algorithms to aggregate sparse measurements into reliable model scores. Our aggregation algorithm ensures identifiability(asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model ranking with less data. On homogenous datasets, we show our aggregation algorithm provides rankings that highly correlate with those produced by average scores. We also demonstrate robustness to ~95% of measurements missing, reducing evaluation cost by up to 20x with little-to-no change in model rankings. We introduce ONEBench-LLM for language models and ONEBench-LMM for vision-language models, unifying evaluations across these domains. Overall, we present a technique for open-ended evaluation, which can aggregate over incomplete, heterogeneous sample-level measurements to continually grow a benchmark alongside the rapidly developing foundation models.

View Paper