Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou

2025-10-22

Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Summary

This paper introduces a new method for making large language models, like those powering chatbots, better at responding in a way that specifically matches what each individual user wants.

What's the problem?

Currently, customizing these models to individual preferences is difficult. Simply retraining them with examples (supervised fine-tuning) stops improving quickly, and using human feedback to guide the model (reinforcement learning) can lead to the model 'gaming the system' – producing long, overly-enthusiastic responses that *seem* personalized but aren't actually helpful. The standard ways of scoring how good a response is are too simple and can be easily tricked.

What's the solution?

The researchers developed a system called Critique-Post-Edit. It works in two main parts: first, a 'Personalized Generative Reward Model' gives detailed feedback on responses, not just a single score, and explains *why* a response is good or bad. Second, the language model then uses this feedback to revise its own answers, focusing on making targeted improvements. This process helps the model learn more effectively and avoid the 'gaming' problem.

Why it matters?

This research shows a significant improvement in personalization, with their models performing better than standard methods and even surpassing the performance of models like GPT-4. This means we're closer to having AI assistants that truly understand and respond to our individual needs and preferences, making them much more useful and enjoyable to interact with.

Abstract

Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.

View Paper