More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, Sheng Liu

2025-06-02

More Thinking, Less Seeing? Assessing Amplified Hallucination in
Multimodal Reasoning Models

Summary

This paper talks about a new way to test how well AI models that work with both images and text can stay focused on what they actually see, instead of making things up, especially when they have to think through complicated problems.

What's the problem?

The problem is that when these multimodal AI models try to reason or answer tough questions, they sometimes start to 'hallucinate,' meaning they make up details that aren't really in the image, which can lead to wrong or misleading answers.

What's the solution?

The researchers created a new test and a scoring system to measure how well these models keep their answers grounded in the actual visual information as they reason. They found that using bigger models and certain types of training data helps the AI balance thinking deeply with sticking to what it really sees.

Why it matters?

This is important because it helps make AI more trustworthy and accurate when dealing with tasks that mix pictures and language, which is useful for things like education, science, and any situation where you need reliable answers based on both images and text.

Abstract

A new metric and benchmark are introduced to evaluate multimodal large language models' ability to maintain visual grounding while performing extended reasoning, revealing that larger models and specific training data types improve this balance.

View Paper