MegaMath: Pushing the Limits of Open Math Corpora
Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, Eric P. Xing
2025-04-07
Summary
This paper talks about Multi-SWE-bench, a test to check if AI can fix real coding problems in different programming languages like Java and C++.
What's the problem?
Current AI coding tests mostly focus on Python, which doesn’t show how well AI handles bugs in other popular languages used in real-world apps.
What's the solution?
They made a new test with 1,632 real coding problems from seven languages, checked by experts, and created tools to help AI learn from these examples.
Why it matters?
This helps improve AI coding assistants to fix bugs in more languages, making them useful for real software projects and advancing smarter AI systems.
Abstract
Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.