AgentInstruct: Toward Generative Teaching with Agentic Flows

Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah

2024-07-10

AgentInstruct: Toward Generative Teaching with Agentic Flows

Summary

This paper talks about AgentInstruct, a new framework designed to automatically create large amounts of diverse and high-quality synthetic data to help improve language models. This approach aims to teach models new skills using data generated from raw sources.

What's the problem?

The main problem is that while synthetic data can be very useful for training language models, it often varies in quality and diversity. Additionally, creating high-quality synthetic data usually requires a lot of human effort to ensure that it is useful and accurate. There are also concerns that using synthetic data can lead to 'model collapse,' where models become less effective because they only imitate other models instead of learning new skills.

What's the solution?

To address these issues, the authors developed AgentInstruct, which uses an agentic framework to automatically generate prompts and responses from raw data sources like text documents and code files. This allows for the creation of a large dataset—25 million pairs of prompts and responses—that can teach language models various skills such as text editing, creative writing, coding, and reading comprehension. The framework enables the post-training of existing models like Mistral-7b, leading to significant improvements in performance across multiple benchmarks compared to previous models.

Why it matters?

This research is important because it shows how we can effectively use synthetic data to enhance the training of language models without requiring extensive human input. By automating the generation of high-quality training data, AgentInstruct helps improve the capabilities of AI systems, making them more effective in understanding and generating human-like text. This advancement could lead to better applications in areas like customer service, content creation, and education.

Abstract

Synthetic data is becoming increasingly important for accelerating the development of language models, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse and drawbacks of imitating other models. This discrepancy can be attributed to the fact that synthetic data varies in quality and diversity. Effective use of synthetic data usually requires significant human effort in curating the data. We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, we refer to this setting as Generative Teaching. We introduce AgentInstruct, an extensible agentic framework for automatically creating large amounts of diverse and high-quality synthetic data. AgentInstruct can create both the prompts and responses, using only raw data sources like text documents and code files as seeds. We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach language models different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base model. We post-train Mistral-7b with the data. When comparing the resulting model Orca-3 to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and GPT-3.5-turbo.

View Paper