MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity
Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang
2025-11-06
Summary
This paper introduces a new way to test how well AI models that can 'see' and 'think' (called Multimodal Large Language Models or MLLMs) actually understand visual information, going beyond just how well they handle text.
What's the problem?
Current tests for these AI models often focus too much on language skills or don't thoroughly check how well they can reason about what they *see*. They don't really dig into the specific ways vision impacts thinking, leaving a gap in understanding their true cognitive abilities. It's like testing a student on a history essay when you really want to know if they can interpret a historical photograph.
What's the solution?
The researchers created a benchmark called MME-CC, which stands for Multi-Modal Evaluation benchmark of Cognitive Capacity. This benchmark includes 11 different tasks that specifically test how well the AI understands spatial relationships, geometry, and general knowledge based on images. They then tested 16 different AI models on this benchmark, looking closely at where they succeed and fail. They also analyzed *how* the models arrive at their answers, noticing a pattern of first looking at the image, then reasoning, and finally checking their work.
Why it matters?
This work is important because it highlights that simply making AI models bigger isn't enough. We need to specifically evaluate and improve their ability to understand and reason about visual information, just like humans do. By focusing on 'cognitive capacity' – how well they actually *think* with images – we can build better and more reliable AI systems.
Abstract
As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.