Sci-Reasoning: A Dataset Decoding AI Innovation Patterns
Jiachen Liu, Maestro Harmon, Zechen Zhang
2026-01-13
Summary
This paper investigates how researchers actually come up with new ideas in Artificial Intelligence, something that isn't well understood despite rapid progress in the field itself.
What's the problem?
Currently, we don't have a good way to track *how* AI researchers build on previous work and identify problems to solve. It's hard to analyze the thought process behind breakthroughs because this information isn't systematically recorded, making it difficult to improve how AI research is done or even to build AI systems that can *do* research themselves.
What's the solution?
The researchers created a dataset called Sci-Reasoning. They looked at highly-regarded AI papers from major conferences (NeurIPS, ICML, and ICLR) and traced back which earlier papers influenced them. They didn't just note the connections, but specifically described *why* those earlier papers were important – what kind of reasoning linked the old work to the new. They used both AI tools and human reviewers to ensure accuracy and identified 15 common patterns of thinking, with a few being particularly frequent, like finding gaps in existing research, combining ideas from different fields, and changing how information is represented.
Why it matters?
This work is important because it provides a way to actually study the process of scientific discovery in AI. By understanding how successful AI research happens, we can potentially improve the research process itself and even build AI systems capable of independent research, accelerating future innovation.
Abstract
While AI innovation accelerates rapidly, the intellectual process behind breakthroughs -- how researchers identify gaps, synthesize prior work, and generate insights -- remains poorly understood. The lack of structured data on scientific reasoning hinders systematic analysis and development of AI research agents. We introduce Sci-Reasoning, the first dataset capturing the intellectual synthesis behind high-quality AI research. Using community-validated quality signals and an LLM-accelerated, human-verified pipeline, we trace Oral and Spotlight papers across NeurIPS, ICML, and ICLR (2023-2025) to its key predecessors, articulating specific reasoning links in a structured format. Our analysis identifies 15 distinct thinking patterns, with three dominant strategies accounting for 52.7%: Gap-Driven Reframing (24.2%), Cross-Domain Synthesis (18.0%), and Representation Shift (10.5%). The most powerful innovation recipes combine multiple patterns: Gap-Driven Reframing + Representation Shift, Cross-Domain Synthesis + Representation Shift, and Gap-Driven Reframing + Cross-Domain Synthesis. This dataset enables quantitative studies of scientific progress and provides structured reasoning trajectories for training the next generation AI research agents.