Aligning Multimodal LLM with Human Preference: A Survey
Tao Yu, Yi-Fan Zhang, Chaoyou Fu, Junkang Wu, Jinda Lu, Kun Wang, Xingyu Lu, Yunhang Shen, Guibin Zhang, Dingjie Song, Yibo Yan, Tianlong Xu, Qingsong Wen, Zhang Zhang, Yan Huang, Liang Wang, Tieniu Tan
2025-03-19
Summary
This paper looks at different ways to make AI models that understand both images and text (Multimodal LLMs) better align with what humans prefer.
What's the problem?
MLLMs are great, but they can still have problems with being truthful, safe, and understanding what humans want. It's hard to make them consistently give answers that humans like.
What's the solution?
The paper reviews different methods used to 'align' MLLMs with human preferences. It looks at the types of tasks these methods are used for, how they collect data to train the models, how they measure success, and what future improvements could be made.
Why it matters?
This work matters because it helps researchers understand how to build AI models that are not only smart but also reliable, safe, and aligned with human values and expectations.
Abstract
Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Alignment.