World-in-World: World Models in a Closed-Loop World

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen

2025-10-22

World-in-World: World Models in a Closed-Loop World

Summary

This paper investigates whether realistic simulated worlds created by AI can actually help robots or 'agents' make good decisions and complete tasks, rather than just looking pretty.

What's the problem?

Currently, we're pretty good at making AI create visually stunning simulated worlds, but it's unclear if these worlds are *useful* for agents trying to navigate and accomplish goals. Existing tests mostly focus on how realistic the visuals are, and don't really check if an agent can actually succeed in a task using the simulated world. It's like judging a flight simulator based on how good the graphics are, instead of whether it actually trains pilots to fly.

What's the solution?

The researchers created a new testing platform called 'World-in-World' where agents interact with these simulated worlds in a more realistic, closed-loop way – meaning the agent's actions directly affect the world and the world responds. They built several different environments within this platform and focused on whether agents could actually *complete tasks*, not just how good the world looked. They also experimented with different ways to improve the world models, like adding more data or increasing computing power during use.

Why it matters?

This work is important because it shows that visual realism isn't the most important thing when building simulated worlds for AI agents. Being able to *control* what happens in the simulation and learning from actions within the simulation are more crucial for success. It also provides insights into how to best improve these world models, suggesting that refining how the AI understands and responds to actions is more effective than just making the visuals more detailed.

Abstract

Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.

View Paper