Evaluating Language Models as Synthetic Data Generators
Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig
2024-12-06
Summary
This paper discusses AgoraBench, a new benchmark designed to evaluate how well language models (LMs) can generate synthetic data, which is important for training AI systems.
What's the problem?
As language models are increasingly used to create synthetic data for training purposes, it's essential to know which models are the best at generating high-quality data. However, previous research has not systematically compared different LMs in a consistent way, making it hard to understand their strengths and weaknesses in data generation.
What's the solution?
To address this issue, the researchers developed AgoraBench, which provides standardized settings and metrics for evaluating the data generation abilities of various LMs. They generated 1.26 million training examples using six different LMs and trained 99 student models on this data. Their findings showed that different LMs excel in different areas; for example, GPT-4o is great at creating new problems, while Claude-3.5-Sonnet is better at improving existing ones. They also discovered that a model's ability to generate data doesn't always match its problem-solving skills, suggesting that other factors like response quality and instruction difficulty are more important indicators of performance.
Why it matters?
This research is significant because it helps improve the understanding of how effective different language models are at generating synthetic data. By providing a systematic way to evaluate these models, AgoraBench can guide developers in selecting the best models for specific tasks, ultimately enhancing the performance of AI systems that rely on high-quality training data.
Abstract
Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.