Training and Evaluating Language Models with Template-based Data Generation

Yifan Zhang

2024-11-28

Training and Evaluating Language Models with Template-based Data Generation

Summary

This paper discusses a new method called Template-based Data Generation (TDG) that helps create a large dataset of math problems to improve the training of language models in solving mathematical tasks.

What's the problem?

Large language models (LLMs) like GPT-3 and others are great at understanding and generating text, but they struggle with complex reasoning tasks, especially in math. This is partly because there aren't enough high-quality datasets specifically designed for training these models on math problems, which limits their effectiveness.

What's the solution?

The authors introduced TDG, which uses a model called GPT-4 to create parameterized templates for generating a wide range of math problems automatically. They created a dataset called TemplateMath Part I: TemplateGSM, containing over 7 million synthetic grade school math problems, each with solutions. This method allows for the generation of an almost unlimited number of problems while ensuring diversity and quality in the dataset.

Why it matters?

This research is important because it addresses the lack of good datasets for training language models on math reasoning. By providing a large and varied collection of math problems, it helps improve the ability of AI systems to understand and solve complex mathematical questions, which can be beneficial in education and other fields that rely on mathematical reasoning.

Abstract

The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, these models often struggle with tasks requiring complex reasoning, particularly in mathematical problem-solving, due in part to the scarcity of large-scale, high-quality, domain-specific datasets necessary for training sophisticated reasoning abilities. To address this limitation, we introduce Template-based Data Generation (TDG), a novel approach that leverages LLMs (GPT-4) to automatically generate parameterized meta-templates, which are then used to synthesize a vast array of high-quality problems and solutions. Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset comprising over 7 million synthetically generated grade school math problems--each accompanied by code-based and natural language solutions--with the potential to generate an effectively unlimited number more. This dataset alleviates the scarcity of large-scale mathematical datasets and serves as a valuable resource for pre-training, fine-tuning, and evaluating LLMs in mathematical reasoning. Our method not only enables the generation of virtually infinite data but also elevates data augmentation to a new level by using GPT-4 for meta-template generation, ensuring diverse and high-quality problem structures. The TemplateMath Part I: TemplateGSM dataset is publicly available at https://huggingface.co/datasets/math-ai/TemplateGSM. The code is available at https://github.com/iiis-ai/TemplateMath.

View Paper