Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli
2024-09-13

Summary
This paper presents Source2Synth, a new method for creating synthetic data that helps train Large Language Models (LLMs) by using real-world information without needing expensive human input.
What's the problem?
LLMs often have difficulty with tasks that require structured data and complex reasoning. Current methods for training these models usually rely on human-created examples, which can be costly and time-consuming. This limits the ability to teach LLMs effectively in challenging scenarios.
What's the solution?
Source2Synth generates synthetic data points based on real data sources, incorporating reasoning steps to improve the quality of the generated data. It filters out low-quality examples to ensure that only useful data is used for training. The method was tested in two areas: multi-hop question answering (MHQA) and tabular question answering (TQA), showing significant performance improvements of 25.51% for TQA and 22.57% for MHQA compared to previous methods.
Why it matters?
This research is important because it offers a more efficient way to create high-quality training data for AI models, allowing them to learn better without relying heavily on human effort. By improving how LLMs handle complex tasks, it can lead to advancements in AI applications across various fields.
Abstract
Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.