Aligning Large Language Models via Self-Steering Optimization

Hao Xiang, Bowen Yu, Hongyu Lin, Keming Lu, Yaojie Lu, Xianpei Han, Le Sun, Jingren Zhou, Junyang Lin

2024-10-23

Aligning Large Language Models via Self-Steering Optimization

Summary

This paper introduces a new method called Self-Steering Optimization (SSO) that helps train large language models (LLMs) more effectively without needing human input to provide feedback.

What's the problem?

Training LLMs usually requires a lot of manual work to tell the model what kinds of responses are good or bad. This process can be slow and expensive because it relies on humans to label data, which is not always practical or scalable.

What's the solution?

The researchers developed SSO, an algorithm that automatically creates high-quality preference signals during training. This means the model can learn on its own what constitutes a good response by comparing chosen and rejected answers without needing human annotations. SSO was tested on two major models, Qwen2 and Llama3.1, and showed that it could improve performance significantly across various benchmarks.

Why it matters?

This method is important because it makes the training of language models faster and less dependent on human effort, allowing for more efficient development of AI systems. It paves the way for better automated alignment in AI, which can lead to more reliable and effective language models in various applications.

Abstract

Automated alignment develops alignment systems with minimal human intervention. The key to automated alignment lies in providing learnable and accurate preference signals for preference learning without human annotation. In this paper, we introduce Self-Steering Optimization (SSO), an algorithm that autonomously generates high-quality preference signals based on predefined principles during iterative training, eliminating the need for manual annotation. SSO maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses while keeping them both on-policy to suit the current policy model's learning capacity. SSO can benefit the online and offline training of the policy model, as well as enhance the training of reward models. We validate the effectiveness of SSO with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals throughout iterative training. Without any manual annotation or external models, SSO leads to significant performance improvements across six subjective or objective benchmarks. Besides, the preference data generated by SSO significantly enhanced the performance of the reward model on Rewardbench. Our work presents a scalable approach to preference optimization, paving the way for more efficient and effective automated alignment.

View Paper