Direct Preference Optimization Using Sparse Feature-Level Constraints
Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang
2024-11-14

Summary
This paper presents a new method called Feature-level constrained Preference Optimization (FPO) to help large language models (LLMs) better align with human preferences while being more efficient and stable.
What's the problem?
Aligning LLMs with what humans want is difficult. Current methods, like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), can be slow and unstable, making it hard to improve these models effectively.
What's the solution?
The authors introduced FPO, which uses Sparse Autoencoders (SAEs) to focus on important features of the data while applying constraints to ensure stability. This method allows for better alignment without the heavy computational costs associated with previous techniques. They found that FPO improved the performance of LLMs by 5.08% while reducing the resources needed for training.
Why it matters?
This research is important because it offers a more efficient way to train LLMs, making them better at understanding and responding to human preferences. By improving how these models learn, we can enhance their performance in real-world applications, leading to smarter and more responsive AI systems.
Abstract
The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.