SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Charles McGrady, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan
2025-07-02
Summary
This paper talks about SciArena, a community-driven platform where researchers can evaluate and compare how well different foundation models perform on tasks related to scientific literature. It uses a system in which scientists vote on model-generated answers to real scientific questions, helping judge which models work best.
What's the problem?
The problem is that scientific literature is growing so fast that it’s hard for researchers to keep up and for AI models to be properly tested on complex and detailed science tasks. Traditional evaluation methods are static and quickly become outdated, making it difficult to know which models are truly good at understanding science.
What's the solution?
The researchers created SciArena as an open and interactive platform where users submit scientific questions, and foundation models generate answers based on real scientific papers. Researchers then vote on which answers are better. This approach uses human expertise and collective intelligence to continuously evaluate models in a fair and dynamic way, keeping evaluations up to date and relevant.
Why it matters?
This matters because it helps ensure AI models used in science are accurate and reliable. By involving the scientific community, SciArena provides a better way to measure progress in AI for scientific tasks, supporting advancements in how science is understood and discovered through AI.
Abstract
SciArena is a community-driven platform for evaluating foundation models on scientific literature tasks using collective intelligence and human voting.