Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xiangxiang Chu, Xiu Li
2026-01-05
Summary
This paper investigates a problem with how we're teaching AI image generators to create pictures we like, specifically when using feedback from people. While these generators are getting better at making images that score well based on human preferences, they often start making very similar, unoriginal images.
What's the problem?
The issue is called 'Preference Mode Collapse,' and it happens when an AI focuses too much on getting a high score from the human feedback. It's like a student only studying to ace a specific test – they might do well on that test, but they don't actually *learn* the material. In image generation, this means the AI finds a few styles or settings that humans consistently rate highly and then just keeps producing variations of those, losing all creativity and diversity. The researchers created a new way to measure how bad this collapse is, called DivGenBench.
What's the solution?
To fix this, the researchers developed a technique called D^2-Align. Think of it like subtly adjusting the teacher's grading criteria. Instead of letting the AI completely dictate what's 'good' based on initial feedback, D^2-Align identifies biases in the feedback system itself and gently corrects them. It does this by tweaking the way the AI understands the feedback *without* changing the feedback system itself. This prevents the AI from getting stuck in those narrow, high-scoring patterns and encourages it to explore a wider range of image possibilities.
Why it matters?
This research is important because it addresses a key limitation in current AI image generation. If we want AI to be truly creative and useful, it needs to be able to generate diverse and original content, not just endlessly repeat what it thinks we want. By preventing 'Preference Mode Collapse,' this work helps move us closer to AI that can produce genuinely innovative and interesting images.
Abstract
Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D^2-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D^2-Align achieves superior alignment with human preference.