MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning
Run-Ze Fan, Zengzhi Wang, Pengfei Liu
2025-07-23
Summary
This paper talks about MegaScience, a set of large, high-quality datasets created to help AI models get better at scientific reasoning by using data from university textbooks and other trusted sources.
What's the problem?
The problem is that existing open-source datasets mainly focus on math and coding but lack the kind of scientific knowledge and reasoning data needed for AI to understand real science well.
What's the solution?
The authors developed two main datasets, TextbookReasoning with questions from 12,000 scientific textbooks, and MegaScience, a huge combined dataset of over 1.25 million examples. They also made a detailed evaluation system to measure how well AI models learn scientific subjects. Their datasets helped AI models outperform existing ones in science reasoning tasks.
Why it matters?
This matters because improving scientific reasoning in AI can help AI scientists support real researchers, make discoveries faster, and assist education, pushing forward what AI can do in science and technology.
Abstract
TextbookReasoning and MegaScience datasets, along with a comprehensive evaluation system, enhance scientific reasoning in AI by providing high-quality, verifiable data and outperforming existing models.