MLLMs are Deeply Affected by Modality Bias
Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, Linfeng Zhang, Danda Pani Paudel, Xuanjing Huang, Yu-Gang Jiang, Nicu Sebe, Dacheng Tao, Luc Van Gool, Xuming Hu
2025-05-28
Summary
This paper talks about how Multimodal Large Language Models (MLLMs), which are supposed to understand both words and images, actually tend to pay more attention to language and ignore other types of information like pictures.
What's the problem?
The problem is that these AI models are designed to combine different types of information, but because they focus so much on language, they don't use visual or other non-text clues as much as they should. This makes them less effective at truly understanding situations that involve more than just words.
What's the solution?
To highlight this issue, the researchers studied how these models behave and showed that there's a strong bias toward language. They suggest that new model designs and training methods are needed to help these AIs pay equal attention to all types of information.
Why it matters?
This matters because if AI can learn to balance language with images and other inputs, it will be much better at understanding the real world, making it more useful for things like education, robotics, and digital assistants.
Abstract
MLLMs exhibit modality bias, favoring language over other modalities like visual inputs, which impedes balanced multimodal integration and necessitates research into balanced strategies and architectures.