Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding
Yinghui Li, Jiayi Kuang, Peng Xing, Daixian Liu, Junnan Dong, Shu-Yu Guo, Yangning Li, Qingyu Zhou, Wenhao Jiang, Hai-Tao Zheng, Ying Shen, Liang Lin, Philip S. Yu
2026-03-20
Summary
This paper investigates how well advanced AI models, specifically those that can understand both images and text, handle symbols like math equations or chemical formulas, which are crucial for complex thinking.
What's the problem?
Current AI models are really good at understanding general images, but they struggle with accurately 'reading' and understanding specific symbols. These symbols aren't like normal pictures; they need precise interpretation. The paper points out that models sometimes seem to get the right answer to a complex problem *without* actually recognizing the symbols involved, which is a big issue.
What's the solution?
The researchers created a set of tests, a 'benchmark', across different areas like language, math, physics, and chemistry, specifically designed to see how well these AI models handle symbols. They then tested some of the best AI models to see where they succeed and, more importantly, where they fail when dealing with these symbolic systems.
Why it matters?
This research is important because it shows a key weakness in current AI: they don't truly 'understand' the symbolic languages used in science and abstract thought. They might be good at predicting what comes next based on patterns, but they don't actually 'see' and interpret the symbols themselves. Fixing this is essential for building AI that can truly assist with scientific discovery and complex problem-solving.
Abstract
While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.