One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

Hongru Cai, Yongqi Li, Tiezheng Yu, Fengbin Zhu, Wenjie Wang, Fuli Feng, Wenjie Li

2026-01-27

One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

Summary

This paper focuses on making large language models, like the ones powering chatbots, better at giving responses people actually like. It goes beyond just general preferences and tries to tailor the model's responses to *each individual* user.

What's the problem?

The biggest challenge is getting enough information from each person to understand what they want. People don't want to give tons of feedback, and a model needs to quickly learn what a *new* user likes without needing a lot of examples from them. Simply collecting data from users and training a model on it isn't efficient enough.

What's the solution?

The researchers propose a new technique called Meta Reward Modeling (MRM). Instead of trying to directly learn what each user prefers, MRM learns *how to learn* user preferences quickly. It imagines each user's preferences as a blend of basic preference 'building blocks' and then figures out the best starting point for those blends so they can be easily adjusted with just a little bit of feedback. They also added a way to make sure the model doesn't ignore users who are harder to understand, ensuring it works well for everyone.

Why it matters?

This work is important because it makes personalized AI more practical. If models can quickly adapt to your individual tastes with minimal input, they'll be much more useful and enjoyable to use. It's a step towards AI that truly understands and responds to *you*, not just the average person.

Abstract

Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, developing these models faces two critical challenges: the scarcity of feedback from individual users and the need for efficient adaptation to unseen users. We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation. To realize this, we propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem. Specifically, we represent each user's reward model as a weighted combination of base reward functions, and optimize the initialization of these weights using a Model-Agnostic Meta-Learning (MAML)-style framework to support fast adaptation under limited feedback. To ensure robustness, we introduce the Robust Personalization Objective (RPO), which places greater emphasis on hard-to-learn users during meta optimization. Extensive experiments on personalized preference datasets validate that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.

View Paper