An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift
Constantinos Karouzos, Xingwei Tan, Nikolaos Aletras
2026-01-12
Summary
This paper investigates how well language models, specifically those 'tuned' to better align with human preferences, perform when used on tasks different from the ones they were originally trained on.
What's the problem?
When you train a language model to respond in a way humans like – focusing on things like helpfulness and safety – it often gets worse at performing well on *new* kinds of tasks or with different types of information than it saw during training. This is called 'domain shift', and it's a big problem because we want these models to be useful in many situations, not just the ones they were initially designed for. Previous research showed this happens, but it wasn't clear which training methods were most affected or how to best fix it.
What's the solution?
The researchers systematically tested five different ways to train models to align with human preferences, and then tried several methods to help them adapt to new tasks. These adaptation methods included further training with examples from the new task and a technique called 'pseudo-labeling,' where the model essentially creates its own training data. They focused on two types of tasks: summarizing text and answering questions, and measured how helpful the model's responses were in both the original and new situations.
Why it matters?
The findings show that some training methods are better at avoiding performance drops when faced with new tasks. Importantly, the researchers discovered that using 'pseudo-labeling' can significantly improve a model’s ability to generalize and remain helpful even when the task changes, which is a crucial step towards building more reliable and versatile AI systems.
Abstract
Preference tuning aligns pretrained language models to human judgments of quality, helpfulness, or safety by optimizing over explicit preference signals rather than likelihood alone. Prior work has shown that preference-tuning degrades performance and reduces helpfulness when evaluated outside the training domain. However, the extent to which adaptation strategies mitigate this domain shift remains unexplored. We address this challenge by conducting a comprehensive and systematic study of alignment generalization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. Our findings reveal systematic differences in generalization across alignment objectives under domain shift. We show that adaptation strategies based on pseudo-labeling can substantially reduce domain-shift degradation