A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang

2025-09-01

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Summary

This paper is a broad overview of how large language models are being used in science, and how the unique characteristics of scientific data impact their development and effectiveness.

What's the problem?

Developing AI for science is harder than developing AI for general language tasks because scientific information is complex. It comes in many forms – like text, images, and data tables – and exists at different levels of detail. Scientific data is also often uncertain and very specific to its field, meaning a model trained on biology might not understand chemistry. Existing language models aren't designed to handle these challenges, and there's a lack of well-organized, high-quality scientific datasets for training them.

What's the solution?

The researchers analyzed a ton of recent scientific language models and the datasets used to build them – over 270 datasets and 190 benchmarks! They created a way to categorize different types of scientific data and how scientific knowledge is structured. They also looked at how scientists are *testing* these models, noting a move towards more complex tests that assess a model’s ability to actually *do* science, not just answer questions. Finally, they explored ways to automatically improve scientific datasets with help from experts.

Why it matters?

This work is important because it provides a roadmap for building AI systems that can truly assist scientists. By understanding the specific challenges of scientific data and how to address them, we can create AI that doesn't just process information, but actively participates in the scientific process – designing experiments, validating results, and expanding our knowledge base. This could significantly speed up scientific discovery.

Abstract

Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.

View Paper