The Leaderboard Illusion

Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker

2025-04-30

Summary

This paper talks about how the Chatbot Arena leaderboard, which is supposed to show which AI chatbots are the best, actually has some big flaws that make its rankings less fair and reliable than they seem.

What's the problem?

The problem is that some AI companies are able to test many different versions of their chatbots in private before releasing them to the public, and they can choose to only show the best scores while hiding the bad ones. This means the leaderboard can be biased in favor of these companies, especially since they also get more chances to have their models tested and rated compared to open-source models. As a result, the rankings end up reflecting who has more access and control over the data, rather than which chatbot is truly the best overall.

What's the solution?

The researchers looked closely at how the leaderboard works and found evidence of these unfair practices, like selective score reporting and uneven data access. They offer suggestions for how to fix the system, such as making the testing process more transparent and making sure all models get a fair shot at being evaluated.

Why it matters?

This matters because if the leaderboard is biased, people and companies might be misled about which AI chatbots are actually the best, slowing down progress and making it harder for new or open-source models to compete fairly. Fixing these issues would help the whole AI field advance in a more honest and trustworthy way.

Abstract

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field

View Paper