FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users
Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn
2025-02-27
Summary
This paper talks about a new way to make AI language models (LLMs) better at personalizing their responses to individual users. The researchers created a method called Few-Shot Preference Optimization (FSPO) that helps AI quickly learn and adapt to a user's preferences based on just a few examples.
What's the problem?
Personalizing AI responses for different users is important for things like virtual assistants and content recommendations. However, it's hard to get enough real data from users to train AI effectively for personalization. Also, current methods often struggle to adapt quickly to individual users' preferences.
What's the solution?
The researchers developed FSPO, which treats personalization as a learning problem where the AI figures out how to adapt quickly. They also created a large set of fake (synthetic) user preference data using other AI models. This synthetic data was carefully designed to be diverse but also consistent for each fake user. They tested FSPO on tasks like personalizing movie reviews, adjusting explanations based on education level, and answering general questions.
Why it matters?
This research matters because it could make AI assistants and other language AI much better at understanding and responding to individual users' needs and preferences. The method works well with both fake users and real people, which means it could be used to improve many AI applications without needing lots of personal data from real users. This could lead to more helpful and personalized AI interactions in everyday life, from better virtual assistants to more tailored online experiences.
Abstract
Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.