LiveBench: A Challenging, Contamination-Free LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum

2024-06-28

LiveBench: A Challenging, Contamination-Free LLM Benchmark

Summary

This paper talks about LiveBench, a new benchmark designed to evaluate large language models (LLMs) without the problems of test set contamination. It aims to provide a fair and challenging way to assess how well these models perform on various tasks.

What's the problem?

Many existing benchmarks for LLMs have a significant issue known as test set contamination. This happens when the data used to test a model includes information that the model has already seen during training, which can make the model's performance seem better than it really is. Additionally, using human judges or other LLMs to score answers can introduce biases and inaccuracies, especially when dealing with difficult questions.

What's the solution?

To solve these problems, the authors created LiveBench, which includes several key features: it has frequently updated questions based on recent information sources, scores answers automatically using objective correct answers instead of relying on human judges, and offers a wide variety of challenging tasks across different categories like math, coding, reasoning, and data analysis. The questions are designed to be tough, with no current model scoring higher than 65% accuracy. LiveBench will continuously add new tasks and harder versions of existing ones to keep up with improvements in LLM capabilities.

Why it matters?

This research is important because it provides a more reliable way to evaluate LLMs by eliminating biases and contamination issues found in previous benchmarks. By offering a fair assessment framework, LiveBench can help researchers and developers better understand the strengths and weaknesses of different LLMs, ultimately leading to improvements in AI technology.

Abstract

Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

View Paper