Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun
2025-06-17
Summary
This paper talks about Scientists' First Exam (SFE), a new benchmark created to test how well Multimodal Large Language Models (MLLMs) can think like scientists by looking at scientific data in different forms like images and text. It measures the models' abilities to perceive scientific signals, understand their meaning, and reason by comparing information across multiple data sources in real scientific disciplines like astronomy, chemistry, and earth sciences.
What's the problem?
The problem is that most current benchmarks only test whether AI models understand scientific facts, but they do not check how well the models can actually perceive detailed scientific data, interpret it, or reason about complex scientific questions involving multiple pieces of information. This gap makes it hard to know if these AI systems can truly assist in real scientific discovery processes.
What's the solution?
The solution is the SFE benchmark which is designed with three levels of scientific cognition: perceiving signals from raw scientific data like images or spectra, understanding attributes by interpreting the meaning of those signals in scientific terms, and performing comparative reasoning by analyzing and drawing conclusions from multiple pieces of scientific data. It consists of 830 expert-approved questions covering 66 multimodal scientific tasks, with real scientific data formats and bilingual questions to thoroughly evaluate the models’ cognitive abilities.
Why it matters?
This matters because it provides a more realistic and rigorous way to test and improve AI models’ scientific thinking skills, helping them move beyond just knowing facts to actually engaging in scientific analysis and reasoning. Better AI in this area could accelerate scientific discoveries by assisting researchers with complex data interpretation and decision-making tasks in various scientific fields.
Abstract
Scientists' First Exam (SFE) benchmark assesses scientific cognitive capacities of Multimodal Large Language Models through perception, understanding, and comparative reasoning.