BabyVision: Visual Reasoning Beyond Language
Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu
2026-01-13
Summary
This research highlights that even though today's powerful AI models, called Multimodal Large Language Models (MLLMs), are good at tasks requiring a lot of knowledge, they surprisingly struggle with basic visual understanding that even young children possess.
What's the problem?
Current MLLMs rely too much on language to understand images. They don't actually 'see' and interpret visuals the way humans do. This means they fail at simple visual tasks that a three-year-old could easily handle, showing a gap in their fundamental visual skills. The researchers wanted a way to specifically test these core visual abilities without language getting in the way.
What's the solution?
To address this, the researchers created a new benchmark called BabyVision. This benchmark includes almost 400 visual challenges, grouped into different types, designed to test basic visual skills like recognizing shapes, understanding object relationships, and tracking movement. They then tested leading MLLMs, like Gemini3-Pro-Preview, on BabyVision and compared their performance to that of humans of different ages. They also explored a way to have the models *generate* answers to visual questions and created a tool to automatically check those answers.
Why it matters?
This work is important because it shows that even the most advanced AI still lacks fundamental visual perception abilities. Improving these abilities is crucial for creating AI that can truly understand the world around it and reason like humans do. The BabyVision benchmark provides a valuable tool for researchers to track progress in this area and build more human-like AI.
Abstract
While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at https://github.com/UniPat-AI/BabyVision for reproduction.