Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

Liang Zeng, Liangjun Zhong, Liang Zhao, Tianwen Wei, Liu Yang, Jujie He, Cheng Cheng, Rui Hu, Yang Liu, Shuicheng Yan, Han Fang, Yahui Zhou

2024-07-13

Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

Summary

This paper discusses the Skywork-Math model series, which aims to improve the mathematical reasoning abilities of large language models (LLMs) by using a large dataset specifically designed for math problems. It highlights how increasing the amount of training data can enhance a model's performance.

What's the problem?

Many existing LLMs struggle with mathematical reasoning due to limited training data. This can lead to poor performance on math-related tasks, making it difficult for these models to understand and solve complex problems accurately.

What's the solution?

The authors introduce the Skywork-Math model series, which is fine-tuned using a new dataset called Skywork-MathQA, containing 2.5 million examples of math problems. This dataset helps the models learn better by providing diverse and high-quality training data. The Skywork-Math models have shown impressive results, achieving over 51% accuracy on a competitive math benchmark and outperforming earlier versions of popular models like GPT-4 on similar tasks.

Why it matters?

This research is significant because it demonstrates that increasing the amount of relevant training data can greatly improve the capabilities of LLMs in mathematical reasoning. By providing insights into how to enhance these models, the findings can help advance AI applications in education, research, and various industries where math skills are essential.

Abstract

In this paper, we investigate the underlying factors that potentially enhance the mathematical reasoning capabilities of large language models (LLMs). We argue that the data scaling law for math reasoning capabilities in modern LLMs is far from being saturated, highlighting how the model's quality improves with increases in data quantity. To support this claim, we introduce the Skywork-Math model series, supervised fine-tuned (SFT) on common 7B LLMs using our proposed 2.5M-instance Skywork-MathQA dataset. Skywork-Math 7B has achieved impressive accuracies of 51.2% on the competition-level MATH benchmark and 83.9% on the GSM8K benchmark using only SFT data, outperforming an early version of GPT-4 on MATH. The superior performance of Skywork-Math models contributes to our novel two-stage data synthesis and model SFT pipelines, which include three different augmentation methods and a diverse seed problem set, ensuring both the quantity and quality of Skywork-MathQA dataset across varying difficulty levels. Most importantly, we provide several practical takeaways to enhance math reasoning abilities in LLMs for both research and industry applications.

View Paper