When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang, Di Wang, Lijie Hu

2025-11-05

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

Summary

This paper investigates how large language models that can process both text and images, called multimodal large language models or MLLMs, decide which type of information to trust when the text and image seem to disagree.

What's the problem?

Currently, researchers have only looked at how well these models generally handle conflicting information across large datasets. This doesn't tell us *why* a model chooses to believe one source over another in a specific situation. The problem is understanding the factors that influence a model’s decision – is it because the model is more confident in its reasoning from one source, or does it simply prefer one type of information over the other?

What's the solution?

The researchers created a new way to analyze this 'modality following' by breaking it down into two key parts: how certain the model is when reasoning with just text or just images (relative reasoning uncertainty), and whether the model has a built-in preference for text or images when both are equally difficult to process (inherent modality preference). They also built a special dataset where they could control how easy or hard it was for the model to understand the text and image separately. By measuring the model’s uncertainty using a concept called entropy, they found a consistent pattern: the more uncertain the model is about one source of information, the less likely it is to trust that source. They also identified a 'balance point' where the model is equally likely to follow either modality, revealing its underlying preference.

Why it matters?

This research is important because it provides a more detailed and accurate understanding of how MLLMs make decisions when faced with conflicting information. Instead of just looking at overall performance, it identifies the specific factors driving these decisions, allowing researchers to build more reliable and trustworthy multimodal AI systems. Understanding these internal mechanisms also helps explain why models sometimes seem indecisive, showing they actually 'oscillate' between different possibilities before settling on an answer.

Abstract

Multimodal large language models (MLLMs) must resolve conflicts when different modalities provide contradictory information, a process we term modality following. Prior work measured this behavior only with coarse dataset-level statistics, overlooking the influence of model's confidence in unimodal reasoning. In this paper, we introduce a new framework that decomposes modality following into two fundamental factors: relative reasoning uncertainty (the case-specific confidence gap between unimodal predictions) and inherent modality preference( a model's stable bias when uncertainties are balanced). To validate this framework, we construct a controllable dataset that systematically varies the reasoning difficulty of visual and textual inputs. Using entropy as a fine-grained uncertainty metric, we uncover a universal law: the probability of following a modality decreases monotonically as its relative uncertainty increases. At the relative difficulty level where the model tends to follow both modalities with comparable probability what we call the balance point, a practical indicator of the model's inherent preference. Unlike traditional macro-level ratios, this measure offers a more principled and less confounded way to characterize modality bias, disentangling it from unimodal capabilities and dataset artifacts. Further, by probing layer-wise predictions, we reveal the internal mechanism of oscillation: in ambiguous regions near the balance point, models vacillate between modalities across layers, explaining externally observed indecision. Together, these findings establish relative uncertainty and inherent preference as the two governing principles of modality following, offering both a quantitative framework and mechanistic insight into how MLLMs resolve conflicting information.

View Paper