Automating Benchmark Design

Amanda Dsouza, Harit Vishwakarma, Zhengyang Qi, Justin Bauer, Derek Pham, Thomas Walshe, Armin Parchami, Frederic Sala, Paroma Varma

2025-10-30

Summary

This paper introduces BeTaL, a new system for automatically creating better ways to test large language models (LLMs). Currently, testing these models is hard because the tests quickly become too easy as the models improve.

What's the problem?

Evaluating how good LLMs are is a challenge. Traditional tests are created by hand and quickly become outdated because LLMs rapidly get better at them. Creating new, constantly updating tests is expensive and time-consuming, leaving a gap in our ability to accurately measure LLM progress and capabilities.

What's the solution?

BeTaL tackles this by using an LLM itself to *design* new tests. It starts with basic test templates and then lets the LLM tweak different parts of those templates – things like the complexity or how realistic the scenarios are – to create tests that are specifically challenging. It’s like having an LLM build tests tailored to push other LLMs to their limits, and it does this efficiently.

Why it matters?

This work is important because it provides a way to keep up with the fast pace of LLM development. By automating the creation of challenging tests, we can get a more accurate understanding of what LLMs can and can’t do, which is crucial for building reliable and trustworthy AI systems. The tests created by BeTaL are significantly closer to the desired difficulty level compared to existing methods.

Abstract

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark tau-bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the baselines.

View Paper