Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin

2024-10-13

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Summary

This paper discusses how simple models, called 'null models,' can achieve high scores on automatic benchmarks for evaluating large language models (LLMs), raising concerns about the reliability of these benchmarks.

What's the problem?

Automatic benchmarks like AlpacaEval 2.0 and Arena-Hard-Auto are used to evaluate LLMs because they are cheaper and faster than human evaluations. However, these benchmarks can be manipulated, allowing models to score well without actually being effective. This is a problem because it makes it hard to tell which models genuinely perform better and could mislead users about their capabilities.

What's the solution?

The authors demonstrate that a null model, which always gives the same irrelevant response regardless of the input, can still achieve high win rates on these benchmarks—like an 86.5% win rate on AlpacaEval 2.0. They show that this happens because the benchmarks can be easily tricked, even with a model that doesn't understand the tasks it's supposed to perform. Their findings suggest that more robust anti-cheating mechanisms need to be developed to ensure that these benchmarks accurately measure the performance of LLMs.

Why it matters?

This research is important because it highlights vulnerabilities in the way LLMs are evaluated. If simple models can cheat the system, then it calls into question the effectiveness of current evaluation methods. By addressing these issues, researchers can work towards creating better benchmarks that truly reflect how well these advanced AI systems perform, ensuring that users get reliable information about their capabilities.

Abstract

Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a "null model" that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks.

View Paper