Pixels, Patterns, but No Poetry: To See The World like Humans

Hongcheng Gao, Zihao Huang, Lin Xu, Jingyi Tang, Xinhao Li, Yue Liu, Haoyang Li, Taihang Hu, Minhua Lin, Xinlong Yang, Ge Wu, Balong Bi, Hongyu Chen, Wentao Zhang

2025-07-24

Pixels, Patterns, but No Poetry: To See The World like Humans

Summary

This paper talks about the Turing Eye Test, a new way to see how well large multimodal language models (MLLMs) understand images by testing them on made-up synthetic pictures and comparing their perception to humans.

What's the problem?

While AI models can recognize patterns and pixels, they don't see and understand the world like humans do, especially when it comes to generalizing or applying their knowledge to new visual situations.

What's the solution?

The researchers created the Turing Eye Test to evaluate the models' ability to perceive and interpret images as humans would, highlighting that the main gap lies in how well the vision parts of these models generalize beyond what they have been trained on.

Why it matters?

This matters because understanding where AI perception falls short compared to humans is key to improving these models, making them better at tasks like image understanding, which are important for applications such as robotics, medicine, and autonomous systems.

Abstract

The Turing Eye Test evaluates MLLMs' perceptual capabilities on synthetic images, revealing that vision tower generalization is a key gap compared to human perception.

View Paper