Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision
Wei Du, Shubham Toshniwal, Branislav Kisacanin, Sadegh Mahdavi, Ivan Moshkov, George Armstrong, Stephen Ge, Edgar Minasyan, Feng Chen, Igor Gitman
2025-12-19
Summary
This paper introduces a new, very large dataset called Nemotron-Math designed to help AI systems get better at solving math problems. It focuses on providing more detailed step-by-step solutions and allowing the AI to use tools like Python to help it.
What's the problem?
Current datasets used to teach AI to solve math problems aren't good enough. They often lack detailed explanations of *how* to solve the problems, don't show a variety of problem-solving approaches, and don't effectively let the AI use helpful tools. This limits how well AI can truly 'reason' through math, instead of just memorizing patterns.
What's the solution?
The researchers created Nemotron-Math, a dataset with 7.5 million math problems and their solutions. These solutions aren't just answers; they show the full thought process. The dataset includes problems from math competitions and questions from online forums, giving a wide range of difficulty and problem types. Importantly, some solutions show how to use Python code to solve the problems. They also developed a clever way to train AI models on these very long solutions efficiently, speeding up the process.
Why it matters?
This work is important because it pushes the boundaries of what AI can do in mathematics. By providing a richer dataset and enabling tool use, the researchers achieved state-of-the-art results, even perfectly solving some very challenging competition problems. This means we're getting closer to AI systems that can genuinely understand and solve complex math, which has implications for education, research, and many other fields.
Abstract
High-quality mathematical reasoning supervision requires diverse reasoning styles, long-form traces, and effective tool integration, capabilities that existing datasets provide only in limited form. Leveraging the multi-mode generation ability of gpt-oss-120b, we introduce Nemotron-Math, a large-scale mathematical reasoning dataset containing 7.5M solution traces across high, medium, and low reasoning modes, each available both with and without Python tool-integrated reasoning (TIR). The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems, combining structured competition tasks with diverse real-world mathematical queries. We conduct controlled evaluations to assess the dataset quality. Nemotron-Math consistently outperforms the original OpenMathReasoning on matched AoPS problems. Incorporating StackExchange-Math substantially improves robustness and generalization, especially on HLE-Math, while preserving accuracy on math competition benchmarks. To support efficient long-context training, we develop a sequential bucketed strategy that accelerates 128K context-length fine-tuning by 2--3times without significant accuracy loss. Overall, Nemotron-Math enables state-of-the-art performance, including 100\% maj@16 accuracy on AIME 2024 and 2025 with Python TIR.