Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem
Valentin Quesnel, Damien Sileo
2025-09-09
Summary
This research focuses on improving the mathematical reasoning abilities of Large Language Models (LLMs) by creating a large dataset of provably correct mathematical statements and challenges.
What's the problem?
LLMs struggle with complex mathematical reasoning because they often lack access to enough high-quality, reliable training data. Existing datasets either rely on LLMs themselves to generate data, which can introduce errors, or use complicated formal proof languages that are hard to work with. Essentially, it's difficult to get LLMs to 'think' mathematically because they haven't been trained on enough solid mathematical facts and problems.
What's the solution?
The researchers built a system that automatically generates mathematical theorems using a powerful theorem prover called E-prover. They started with a huge library of mathematical axioms (basic truths) and let E-prover find new, logically valid theorems based on those axioms. This process guarantees the data is correct, unlike data created by LLMs. Then, they turned these theorems into three types of reasoning tasks – checking if statements are true, finding the necessary starting points for a proof, and reconstructing proofs – all designed to test different aspects of mathematical reasoning.
Why it matters?
This work is important because it identifies a key weakness in current LLMs: their inability to perform deep, structural reasoning in math. It also provides a way to create a large, reliable dataset to train LLMs to overcome this weakness. By offering both a way to measure how well LLMs reason mathematically and a source of training data to improve them, this research could significantly advance the field of AI and its ability to solve complex problems.
Abstract
The scarcity of high-quality, logically sound data is a critical bottleneck for advancing the mathematical reasoning of Large Language Models (LLMs). Our work confronts this challenge by turning decades of automated theorem proving research into a scalable data engine. Rather than relying on error-prone LLMs or complex proof-assistant syntax like Lean and Isabelle, our framework leverages E-prover's saturation capabilities on the vast TPTP axiom library to derive a massive, guaranteed-valid corpus of theorems. Our pipeline is principled and simple: saturate axioms, filter for "interesting" theorems, and generate tasks. With no LLMs in the loop, we eliminate factual errors by construction. This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction. Our zero-shot experiments on frontier models reveal a clear weakness: performance collapses on tasks requiring deep, structural reasoning. Our framework provides both the diagnostic tool to measure this gap and a scalable source of symbolic training data to address it. We make the code and data publicly available. https://github.com/sileod/reasoning_core https://hf.co/datasets/reasoning-core/rc1