Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, Feilong Tang, Ming Hu, Shiyan Su, Xiaocheng Zou, Wei Feng, Dwarikanath Mahapatra, Yifan Peng, Mingquan Lin, Zongyuan Ge

2026-03-18

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Summary

This paper focuses on improving how well AI models that understand both images and text can answer questions, specifically by reducing instances where the model 'hallucinates' or makes up information.

What's the problem?

Current AI models, called multimodal large reasoning models, sometimes get confused when answering questions, especially when dealing with words that indicate a shift in thought, like 'because' or 'however'. These words often lead to the model being uncertain and generating incorrect answers. The researchers believe this happens because the model relies too much on picking specific words one at a time, instead of using the broader context it already understands from the image and text.

What's the solution?

The researchers developed a new technique called Latent Entropy-Aware Decoding, or LEAD. This method helps the model use a more nuanced understanding of the information. When the model is uncertain (high 'entropy'), LEAD encourages it to consider multiple possible meanings at once, using a continuous representation instead of focusing on single words. It also helps the model pay closer attention to the visual parts of the question. Essentially, LEAD makes the model think more flexibly and consider all the available clues before giving an answer.

Why it matters?

This research is important because it makes AI models more reliable and trustworthy. By reducing hallucinations, these models can provide more accurate answers to questions about images and text, which is crucial for applications like image captioning, visual question answering, and potentially even more complex tasks like medical diagnosis or scientific research.

Abstract

Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.

View Paper