MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan

2025-02-17

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Summary

This paper talks about MM-RLHF, a new way to make AI models that understand both text and images (called Multimodal Large Language Models or MLLMs) work better by aligning them more closely with what humans prefer.

What's the problem?

Current MLLMs are really good at some things, but they're not great at understanding what humans actually want or prefer across different tasks. This is because most research has focused on fixing specific issues, like reducing made-up information, instead of making the AI better at understanding human preferences overall.

What's the solution?

The researchers created MM-RLHF, a huge dataset with 120,000 examples of human preferences for AI responses. They also came up with two new techniques: a Critique-Based Reward Model that explains why an AI response is good or bad before scoring it, and Dynamic Reward Scaling that helps the AI learn more from the best examples. They tested their approach thoroughly and found it made a specific AI model much better at conversation and much safer to use.

Why it matters?

This matters because it could make AI that understands both text and images much more helpful and reliable for people to use. The 19.5% improvement in conversation skills means the AI can communicate more naturally, while the 60% increase in safety is a big step towards making these systems more trustworthy. By sharing their work openly, the researchers are helping other scientists make even better AI systems in the future.

Abstract

Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing 120k fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across 10 distinct dimensions and 27 benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a 19.5% increase in conversational abilities and a 60% improvement in safety. We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.

View Paper