Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models
Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou
2024-06-21

Summary
This paper introduces AutoIF, a new method for automatically creating high-quality training data that helps large language models (LLMs) follow instructions better.
What's the problem?
Large language models are designed to understand and follow instructions given in natural language, but creating the training data they need is challenging. Traditional methods require a lot of manual work from human annotators to write instructions and responses, which is time-consuming and not scalable. Additionally, existing models can make mistakes, leading to unreliable training data that can hinder the model's ability to perform complex tasks accurately.
What's the solution?
The researchers developed AutoIF, which automates the process of generating instruction-following training data. This method involves three main steps: first, LLMs create their own instructions; second, they generate code that checks if their responses are correct; and third, they create unit tests to verify the code's accuracy. If an instruction cannot be verified through this code, it gets discarded. This approach uses execution feedback-based rejection sampling to ensure that only high-quality data is used for training. The results showed significant improvements in the performance of models like Qwen2 and LLaMA3 across various training algorithms.
Why it matters?
This research is important because it addresses a major challenge in improving the capabilities of LLMs. By automating the generation of reliable training data, AutoIF makes it easier to train models that can accurately follow complex instructions. This advancement could lead to more effective AI applications in fields like customer service, education, and any other area where understanding and executing instructions is crucial.
Abstract
One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.