A Survey on Large Language Model Benchmarks
Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, Min Yang
2025-08-22
Summary
This paper is a comprehensive overview of how we measure the performance of large language models, like the ones powering chatbots. It looks at all the different tests, called benchmarks, that people are using to see how good these models are.
What's the problem?
Currently, the tests used to evaluate large language models aren't perfect. Some tests might give artificially high scores because the models were accidentally trained on the same data as the test itself. Also, many tests are biased towards certain cultures or languages, making it unfair to compare models. Finally, existing benchmarks don't really assess how well models perform in real-time, changing situations or if they can be trusted to provide accurate reasoning steps.
What's the solution?
The authors reviewed 283 different benchmarks and grouped them into three types: tests of general skills (like grammar and knowledge), tests focused on specific fields (like science or history), and tests that look at potential problems (like safety and reliability). By categorizing these tests, they highlight the strengths and weaknesses of current evaluation methods and suggest a better way to design new, more reliable benchmarks.
Why it matters?
This work is important because it helps us understand how to accurately measure the abilities of large language models. Better benchmarks mean we can more effectively improve these models, making them more useful, safe, and fair for everyone. It guides future development by pointing out what aspects of model performance *really* need to be tested.
Abstract
In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.