MMGR: Multi-Modal Generative Reasoning

Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Xiao Wen, Jiuxiang Gu, Nanyun Peng, Junjie Hu

2025-12-17

Summary

This paper investigates how well new video and image generating AI models actually *understand* the world, rather than just creating things that *look* realistic. It finds that while these models are good at making videos and images appear believable, they often fail at tasks requiring basic reasoning about physics, logic, and spatial relationships.

What's the problem?

Current ways of judging how good these AI models are, like Frechet Video Distance, focus on how visually appealing the output is. This means a model can score well even if it makes things happen that are physically impossible or logically inconsistent. The core issue is that existing metrics don't test if the AI truly understands how the world works, only if it can *mimic* what the world looks like.

What's the solution?

The researchers created a new evaluation system called MMGR, which stands for Multi-Modal Generative Reasoning. MMGR tests models on five different types of reasoning – understanding physics, logic, 3D space, 2D space, and time. They tested leading video and image models on tasks like solving abstract puzzles, navigating virtual environments, and predicting how objects will interact physically. They then used detailed measurements to see if the models were getting things right based on actual reasoning, not just visual similarity.

Why it matters?

This work is important because it shows that current AI models are still far from being true 'world simulators'. They can create impressive visuals, but they lack a fundamental understanding of how things work. By highlighting these weaknesses, the researchers provide a roadmap for improving these models so they can not only *show* us a world, but also *understand* it, which is crucial for building AI that can reliably interact with and reason about the real world.

Abstract

Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.

View Paper