MultiWorld: Scalable Multi-Agent Multi-View Video World Models
Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu
2026-04-21
Summary
This paper introduces a new system called MultiWorld that's designed to predict what will happen in videos, but specifically when multiple things (like agents or robots) are interacting with each other from different viewpoints.
What's the problem?
Current video prediction models are really good at figuring out what happens next when *one* thing is doing something, but they struggle when you have multiple things interacting. Imagine trying to predict what happens in a soccer game – it’s way more complicated than predicting what happens when just one person kicks a ball. Existing models also have trouble making sure everything looks consistent when you're looking at the scene from different angles, like multiple cameras.
What's the solution?
The researchers created MultiWorld, which uses two key ideas. First, a 'Multi-Agent Condition Module' helps the system understand and control each individual agent separately. Second, a 'Global State Encoder' makes sure that all the different views of the scene stay consistent with each other. The system can also handle a varying number of agents and viewpoints and generates these views simultaneously to speed things up.
Why it matters?
This work is important because it allows for more realistic simulations of complex environments. This could be useful for training robots to work together, developing more intelligent video games with multiple players, or even creating better simulations for planning and testing in the real world. Basically, it’s a step towards making AI that can understand and interact with the world as we do, which is full of multiple interacting objects and people.
Abstract
Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/