Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Yuyang Ding, Xinyu Shi, Xiaobo Liang, Juntao Li, Qiaoming Zhu, Min Zhang

2024-10-25

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Summary

This paper discusses ScaleQuest, a new method for creating high-quality question-and-answer datasets that improve the reasoning abilities of large language models (LLMs) by generating questions from scratch.

What's the problem?

Large language models need a lot of high-quality data to improve their reasoning skills, but it's hard to find or create enough good data. Existing methods often rely on using seed questions or knowledge bases, which can limit the quality and quantity of the training data. Additionally, many open-source models lack effective and affordable ways to synthesize data at scale.

What's the solution?

The authors introduced ScaleQuest, a method that uses smaller open-source models to generate questions without needing initial seed data. This approach allows them to create a large dataset of 1 million problem-solution pairs specifically for mathematical reasoning. By using this dataset to fine-tune various LLMs, they achieved significant improvements in performance, showing gains of 29.2% to 46.4% on reasoning tasks compared to existing datasets.

Why it matters?

This research is important because it provides a scalable and efficient way to enhance the training of language models, making them better at reasoning tasks. By generating high-quality datasets automatically, it opens up new possibilities for improving AI systems in education, research, and other fields that rely on accurate reasoning.

Abstract

The availability of high-quality data is one of the most important factors in improving the reasoning capability of LLMs. Existing works have demonstrated the effectiveness of creating more instruction data from seed questions or knowledge bases. Recent research indicates that continually scaling up data synthesis from strong models (e.g., GPT-4) can further elicit reasoning performance. Though promising, the open-sourced community still lacks high-quality data at scale and scalable data synthesis methods with affordable costs. To address this, we introduce ScaleQuest, a scalable and novel data synthesis method that utilizes "small-size" (e.g., 7B) open-source models to generate questions from scratch without the need for seed data with complex augmentation constraints. With the efficient ScaleQuest, we automatically constructed a mathematical reasoning dataset consisting of 1 million problem-solution pairs, which are more effective than existing open-sourced datasets. It can universally increase the performance of mainstream open-source models (i.e., Mistral, Llama3, DeepSeekMath, and Qwen2-Math) by achieving 29.2% to 46.4% gains on MATH. Notably, simply fine-tuning the Qwen2-Math-7B-Base model with our dataset can even surpass Qwen2-Math-7B-Instruct, a strong and well-aligned model on closed-source data, and proprietary models such as GPT-4-Turbo and Claude-3.5 Sonnet.

View Paper