ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang, Jie Hu, Xiaojiang Peng, Lin Ma, Xiaoming Wei, Xiu Li

2026-04-02

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

Summary

This paper points out that even though AI models that generate images and videos look really good, they often struggle with tasks that require understanding how the physical world works, like cause and effect or spatial relationships. They argue that current tests don't really show how flawed these models are.

What's the problem?

The main problem is that we're getting tricked into thinking AI vision models are smarter than they actually are. Existing tests focus on whether the *final* image or video looks good, not on *how* the AI arrived at that result. This means the AI can 'cheat' and still score well without actually understanding the underlying concepts of physics or logic. It's like getting a good grade on a test by memorizing answers instead of understanding the material.

What's the solution?

To fix this, the researchers created a new testing framework called ViGoR. This framework is different because it tests both the steps the AI takes *during* the generation process and the final result. It also uses an automated system that checks the AI's reasoning against what a human would expect, and breaks down performance into specific areas of reasoning to pinpoint exactly where the AI is failing. It works with both images and videos to provide a comprehensive evaluation.

Why it matters?

This work is important because it provides a much more accurate way to measure the intelligence of AI vision models. By exposing the weaknesses in these models, it helps researchers focus on improving their reasoning abilities, which is crucial for building truly intelligent systems that can interact with the real world safely and effectively. It's a 'stress test' that will push the next generation of AI to be more than just visually appealing.

Abstract

Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/

View Paper