IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

Xinghua Zhang, Haiyang Yu, Cheng Fu, Fei Huang, Yongbin Li

2024-11-12

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

Summary

This paper introduces IOPO, a new method designed to improve how large language models (LLMs) follow complex instructions by optimizing their understanding of both the inputs and outputs.

What's the problem?

As LLMs are increasingly used for tasks that involve complex instructions, there is a lack of sufficient data to evaluate their performance in this area. Additionally, existing algorithms do not effectively enhance these models' abilities to follow intricate instructions, leading to potential misunderstandings and errors in their responses.

What's the solution?

To tackle these issues, the authors present TRACE, a benchmark with 120,000 training examples and 1,000 evaluation examples specifically focused on complex instruction following. They also propose the Input-Output Preference Optimization (IOPO) method, which considers both the preferences for the input instructions and the desired outputs. This dual focus helps LLMs better understand and respond to complex tasks. The authors conducted experiments that showed significant improvements in the models' performance on both familiar and new datasets.

Why it matters?

This research is important because it enhances the ability of LLMs to accurately follow complex instructions, making them more useful in real-world applications. By improving how these models learn from both inputs and outputs, IOPO can lead to better performance in various fields such as customer service, programming assistance, and automated content generation.

Abstract

In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

View Paper