MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan

2025-07-18

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Summary

This paper talks about MindJourney, a system that helps vision-language models better understand and reason about 3D spaces by letting them imagine what different views of a scene would look like in a video, even when given only one image.

What's the problem?

The problem is that vision-language models often only see 2D images and can't easily predict or understand how a scene looks from different angles or after moving around, which is important for tasks like navigation.

What's the solution?

The authors created a framework where the model plans a path for a virtual camera to move through the scene, while a world model generates a video showing what those views would look like. The vision-language model then uses these imagined views to think more deeply and answer spatial reasoning questions. This process works without needing additional training.

Why it matters?

This matters because it gives AI a better understanding of 3D environments, improving applications like robotics, virtual reality, and any task requiring spatial awareness, without extra costly training.

Abstract

MindJourney enhances vision-language models with 3D reasoning by coupling them with a video diffusion-based world model, achieving improved performance on spatial reasoning tasks without fine-tuning.

View Paper