Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Elmira Amirloo, Jean-Philippe Fauconnier, Christoph Roesmann, Christian Kerl, Rinu Boney, Yusu Qian, Zirui Wang, Afshin Dehghan, Yinfei Yang, Zhe Gan, Peter Grasch

2024-07-03

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Summary

This paper talks about how to improve the performance of Multimodal Large Language Models (MLLMs) by better aligning their responses with the information from images and other types of data, addressing issues like incorrect or inconsistent outputs.

What's the problem?

The main problem is that MLLMs often struggle to provide accurate responses when dealing with images. They can produce 'hallucinations,' which means they might say things that are not true or give answers that don't match the image content. This inconsistency makes it hard for these models to be reliable.

What's the solution?

To tackle this issue, the authors analyze different methods for aligning MLLMs with image information. They categorize alignment techniques into two groups: offline methods (like Direct Preference Optimization) and online methods (like online-DPO). They also introduce a new method called Bias-Driven Hallucination Sampling (BDHS), which creates multimodal preference data without needing extra annotations or external models. This new approach helps improve how well MLLMs can align their outputs with the information they receive from images.

Why it matters?

This research is important because it enhances the reliability of MLLMs, making them better at understanding and responding to visual information. By improving alignment techniques, we can create AI systems that are more accurate and useful in real-world applications, such as in education, healthcare, and entertainment.

Abstract

Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by producing responses that are inconsistent with the image content. A primary objective of alignment for MLLMs is to encourage these models to align responses more closely with image information. Recently, multiple works have introduced preference datasets for MLLMs and examined different alignment methods, including Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO). However, due to variations in datasets, base model types, and alignment methods, it remains unclear which specific elements contribute most significantly to the reported improvements in these works. In this paper, we independently analyze each aspect of preference alignment in MLLMs. We start by categorizing the alignment algorithms into two groups, offline (such as DPO), and online (such as online-DPO), and show that combining offline and online methods can improve the performance of the model in certain scenarios. We review a variety of published multimodal preference datasets and discuss how the details of their construction impact model performance. Based on these insights, we introduce a novel way of creating multimodal preference data called Bias-Driven Hallucination Sampling (BDHS) that needs neither additional annotation nor external models, and show that it can achieve competitive performance to previously published alignment work for multimodal models across a range of benchmarks.

View Paper