VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu

2025-04-24

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models

Summary

This paper talks about VisuLogic, which is a new test designed to see how well large language models that can understand both pictures and text are able to actually reason and solve problems based on what they see.

What's the problem?

The problem is that while these multimodal models are getting better at tasks like describing images or answering questions about them, it's not clear if they can really think through visual problems the way humans do. There are concerns that these models might just be matching patterns instead of truly understanding or reasoning about what they see.

What's the solution?

The researchers created VisuLogic, a special benchmark with challenging tasks that require real visual reasoning, not just basic recognition or description. They then tested different multimodal models with VisuLogic and compared their results to how well humans do on the same tasks.

Why it matters?

This is important because it shows that even the best current models still have a long way to go before they can match human-level thinking when it comes to understanding and reasoning about images. Knowing where these gaps are helps researchers focus on making future models smarter and more reliable.

Abstract

VisuLogic is a benchmark to evaluate genuine vision-centric reasoning in multimodal large language models, revealing significant performance gaps compared to human accuracy.

View Paper