DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

Alexander Rubinstein, Benjamin Raible, Martin Gubri, Seong Joon Oh

2025-10-13

DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

Summary

This paper addresses the issue of how expensive it is to thoroughly test and evaluate large, modern machine learning models. It introduces a new method for quickly and accurately predicting how well a model will perform without needing to run it on massive datasets.

What's the problem?

Evaluating these complex models, like large language models, requires a huge amount of computing power – often thousands of hours on powerful GPUs. This high cost creates several problems: it limits who can participate in developing and improving these models, it slows down the overall pace of progress, and it’s bad for the environment due to the energy consumption. Current methods try to pick a small, representative set of data to test on, hoping the results will reflect performance on the full dataset, but choosing this 'anchor' set is tricky and relies on potentially flawed clustering techniques.

What's the solution?

The researchers propose a method called Diversifying Sample Condensation, or DISCO. Instead of trying to find a diverse *set* of data points, DISCO focuses on finding the data points where different models *disagree* the most in their answers. The idea is that these points are the most informative for predicting overall performance. It’s a simpler approach than previous methods because it doesn’t need complex clustering; it just looks at how models respond to each sample individually. They also show, mathematically, that focusing on disagreement is a really efficient way to pick the most useful samples.

Why it matters?

DISCO offers a faster and more accurate way to evaluate machine learning models, potentially making the field more accessible and accelerating innovation. By reducing the need for extensive and costly testing, it can lower the barrier to entry for researchers and developers, and lessen the environmental impact of AI development. The method achieves better results than existing techniques on several common benchmarks, demonstrating its effectiveness.

Abstract

Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that maximise diversity in model responses. Our method, Diversifying Sample Condensation (DISCO), selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. DISCO shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: https://github.com/arubique/disco-public.

View Paper