Evaluating and Steering Modality Preferences in Multimodal Large Language Model

Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang

2025-06-02

Evaluating and Steering Modality Preferences in Multimodal Large
Language Model

Summary

This paper talks about how multimodal large language models, which are AI systems that can handle things like text, images, and audio, often show a preference for one type of input over others, and how this preference can actually be controlled to make the models work better on different tasks.

What's the problem?

The problem is that these AI models tend to rely too much on certain types of information, like text, while ignoring or underusing other types, like images or audio. This is called modality bias, and it means the models aren't using all the information they could, which can make them less accurate or flexible, especially when the main type of input is missing or not helpful.

What's the solution?

The researchers found that you can use a method called representation engineering to adjust how much the model pays attention to each type of input. By doing this, they can reduce the model's over-reliance on one kind of data and make it use all its available inputs more evenly. This helps the model do better on tasks like avoiding made-up answers and translating between different types of data.

Why it matters?

This is important because it makes AI systems more balanced and reliable, allowing them to handle a wider range of situations and tasks. By fixing modality bias, these models can provide better results in real-world applications where information comes from many sources, like combining text, images, and sounds.

Abstract

MLLMs exhibit modality bias in multimodal processing, which can be controlled using a representation engineering method to improve tasks like hallucination mitigation and multimodal machine translation.

View Paper