MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

Huanqia Cai, Yijun Yang, Winston Hu

2025-02-04

MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal
Models

Summary

This paper talks about MM-IQ, a new way to test how well AI systems that use both images and text can think and reason like humans. It’s similar to an IQ test for computers, focusing on their ability to solve problems that require abstract thinking.

What's the problem?

AI systems that work with both pictures and text don’t have a good way to measure how well they can think abstractly or reason like humans. Current tests often focus on specific knowledge or language skills, but they don’t evaluate the kind of reasoning that human IQ tests are designed to measure.

What's the solution?

The researchers created MM-IQ, a benchmark with 2,710 carefully designed test questions that cover eight different types of reasoning. These questions are meant to challenge AI systems in the same way human IQ tests challenge people. They used MM-IQ to test some of the best AI systems and found that even the most advanced ones performed only slightly better than random guessing.

Why it matters?

This research is important because it shows how far AI still needs to go to match human reasoning abilities. The results highlight a big gap between current AI systems and human-level abstract thinking, pointing out the need for major improvements in how AI is designed and trained. This could lead to smarter and more capable AI in the future.

Abstract

IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms. Through systematic evaluation of leading open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (27.49% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal systems in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide.

View Paper