Rethinking Diverse Human Preference Learning through Principal Component Analysis
Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen
2025-02-19
Summary
This paper talks about Decomposed Reward Models (DRMs), a new way to understand and use human preferences in AI systems by breaking them down into simpler parts using a method called Principal Component Analysis (PCA).
What's the problem?
AI systems need to understand human preferences to work better and feel more personalized, but human preferences are very complex and diverse. Current methods for capturing these preferences require detailed data, which is expensive and hard to collect, and they often fail to fully represent the variety of what people want.
What's the solution?
The researchers created DRMs, which use PCA to analyze human preferences as vectors (like arrows pointing in different directions). By comparing preferred and rejected responses, they identified key 'directions' that represent different aspects of preferences, like safety or humor. These directions can be combined in various ways to match individual user needs without requiring extra training. This approach is scalable, interpretable, and adaptable to new users.
Why it matters?
This matters because DRMs make it easier and cheaper to align AI systems with what people actually want. They allow for more personalized and flexible AI interactions while providing clear explanations of how the AI understands preferences. This could improve the way AI systems are used in areas like customer service, education, and entertainment.
Abstract
Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment.