Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation

Qiyue Gao, Xinyu Pi, Kevin Liu, Junrong Chen, Ruolan Yang, Xinqi Huang, Xinyu Fang, Lu Sun, Gautham Kishore, Bo Ai, Stone Tao, Mengyang Liu, Jiaxi Yang, Chao-Jung Lai, Chuanyang Jin, Jiannan Xiang, Benhao Huang, Zeming Chen, David Danks, Hao Su, Tianmin Shu, Ziqiao Ma

2025-06-30

Do Vision-Language Models Have Internal World Models? Towards an Atomic
Evaluation

Summary

This paper talks about a new way to test how well Vision-Language Models (VLMs) understand the world around them by evaluating their ability to perceive scenes and predict what might happen next.

What's the problem?

The problem is that while VLMs can connect images with text, it's unclear how much they actually understand about the objects and events in the scenes or how well they can anticipate what comes next. Existing tests don't measure these core abilities effectively.

What's the solution?

The researchers created a detailed benchmark framework that breaks down the evaluation into small atomic tasks to check the models' perception and prediction skills. This helps reveal the limits of current VLMs in truly understanding and modeling the world inside their systems.

Why it matters?

This matters because knowing the strengths and weaknesses of VLMs in world modeling helps guide future research to build smarter and more reliable AI that can better perceive and interact with the real world.

Abstract

A benchmark framework evaluates the world modeling capabilities of Vision-Language Models, highlighting their limitations in perception and prediction.

View Paper