From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

2024-06-19

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Summary

This paper presents a new system called BenchBuilder, which automatically creates high-quality benchmarks for evaluating language models using data from live crowdsourced sources like the Chatbot Arena. It aims to improve how we assess the performance of these models by filtering out the best prompts from a large pool of user-generated content.

What's the problem?

As language models have advanced rapidly, there is a growing need for better benchmarks that can accurately measure their capabilities. Current benchmarks often struggle to differentiate between different models and do not reflect real-world user preferences. Additionally, while platforms like the Chatbot Arena collect a wide variety of prompts and user feedback, the quality of these prompts can vary significantly, making it difficult to create reliable benchmarks based on them.

What's the solution?

To tackle this issue, the authors developed BenchBuilder, which identifies high-quality prompts using seven key indicators, such as specificity and domain knowledge. The system employs an LLM (large language model) annotator to evaluate and select the best prompts from various topic clusters. This process ensures that only the most challenging and relevant prompts are used for benchmarking. The result is a new benchmark called Arena-Hard-Auto v0.1, which consists of 500 carefully curated prompts. This benchmark has shown to be significantly more effective than previous ones, achieving an impressive 89.1% agreement with human preferences and providing tighter confidence intervals for model evaluations.

Why it matters?

This research is important because it addresses the need for more effective evaluation methods for language models. By using BenchBuilder to create high-quality benchmarks from live data, developers can better assess how well their models perform in real-world scenarios. This advancement could lead to improved AI systems that are more aligned with user needs and expectations, ultimately enhancing applications in various fields such as customer service, education, and healthcare.

Abstract

The rapid evolution of language models has necessitated the development of more challenging benchmarks. Current static benchmarks often struggle to consistently distinguish between the capabilities of different models and fail to align with real-world user preferences. On the other hand, live crowd-sourced platforms like the Chatbot Arena collect a wide range of natural prompts and user feedback. However, these prompts vary in sophistication and the feedback cannot be applied offline to new models. In order to ensure that benchmarks keep up with the pace of LLM development, we address how one can evaluate benchmarks on their ability to confidently separate models and their alignment with human preference. Under these principles, we developed BenchBuilder, a living benchmark that filters high-quality prompts from live data sources to enable offline evaluation on fresh, challenging prompts. BenchBuilder identifies seven indicators of a high-quality prompt, such as the requirement for domain knowledge, and utilizes an LLM annotator to select a high-quality subset of prompts from various topic clusters. The LLM evaluation process employs an LLM judge to ensure a fully automated, high-quality, and constantly updating benchmark. We apply BenchBuilder on prompts from the Chatbot Arena to create Arena-Hard-Auto v0.1: 500 challenging user prompts from a wide range of tasks. Arena-Hard-Auto v0.1 offers 3x tighter confidence intervals than MT-Bench and achieves a state-of-the-art 89.1% agreement with human preference rankings, all at a cost of only $25 and without human labelers. The BenchBuilder pipeline enhances evaluation benchmarks and provides a valuable tool for developers, enabling them to extract high-quality benchmarks from extensive data with minimal effort.

View Paper