Can Vision-Language Models Solve the Shell Game?

Tiedong Liu, Wee Sun Lee

2026-03-16

Can Vision-Language Models Solve the Shell Game?

Summary

This paper investigates how well Vision-Language Models (VLMs) – AI systems that understand both images and text – can track specific objects in videos. It finds that these models are surprisingly bad at it, even when the objects look identical, and proposes a new method to improve their tracking abilities.

What's the problem?

Current VLMs struggle with truly *tracking* objects over time. They often rely on recognizing objects in individual frames rather than understanding that it’s the *same* object moving around. The researchers created a special test, VET-Bench, with visually identical objects to highlight this weakness, and found that VLMs essentially perform randomly on it. This happens because the way these models are built, using a specific type of neural network architecture called a transformer, limits their ability to remember and update information about objects as they move through a video without extra help.

What's the solution?

The researchers developed a technique called Spatiotemporal Grounded Chain-of-Thought (SGCoT). This method forces the VLM to explicitly predict the path, or trajectory, of an object as an intermediate step. Think of it like the model saying, 'First the object was here, then it moved there, and now it's here.' They trained the model to do this using synthetic data and a tool that can accurately track objects, and this significantly improved its performance on the VET-Bench test, achieving over 90% accuracy.

Why it matters?

This research is important because reliable object tracking is crucial for VLMs to truly understand videos. If a model can’t follow an object, it can’t answer questions about what’s happening in the video or perform tasks that require understanding object interactions. By identifying this limitation and proposing a solution, the researchers are helping to build more intelligent and capable AI systems that can better interpret the visual world.

Abstract

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .

View Paper