StatEval: A Comprehensive Benchmark for Large Language Models in Statistics
Yuchen Lu, Run Yang, Yichen Zhang, Shuguang Yu, Runpeng Dai, Ziwei Wang, Jiayi Xiang, Wenxin E, Siran Gao, Xinyao Ruan, Yirui Huang, Chenjing Xi, Haibo Hu, Yueming Fu, Qinglan Yu, Xiaobing Wei, Jiani Gu, Rui Sun, Jiaxuan Jia, Fan Zhou
2025-10-13
Summary
This paper introduces StatEval, a new and extensive benchmark designed to test how well large language models (LLMs) can handle statistics problems, ranging from basic college coursework to advanced research-level questions.
What's the problem?
While LLMs are getting better at math and logic, their ability to do statistics hasn't been thoroughly tested. Existing benchmarks don't really focus on the specific skills needed for statistical reasoning, leaving a gap in understanding how these models perform in this important field. It's hard to know if LLMs truly 'understand' statistics or are just recognizing patterns.
What's the solution?
The researchers created StatEval, a collection of over 16,000 statistics problems. They didn't just write these problems themselves; they built a system that automatically finds, rewrites, and checks the quality of problems from textbooks and research papers, with humans verifying the accuracy. They also developed a way to evaluate the models' answers, looking at both calculations and the logical steps used to solve the problems. They then tested several LLMs, including some powerful closed-source models and publicly available ones, using StatEval.
Why it matters?
This work is important because it shows that current LLMs still struggle with statistics, even those that perform well on other tasks. StatEval provides a standard way to measure progress in this area and will help researchers develop better LLMs that can truly reason statistically, which is crucial for fields like data science, medicine, and economics.
Abstract
Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce StatEval, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57\% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.