Solaris: Building a Multiplayer Video World Model in Minecraft
Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie
2026-02-26
Summary
This paper introduces Solaris, a new system for creating realistic videos of multiple agents interacting with each other, like in a video game. It's a 'world model' meaning it tries to learn how the world works and then generate new, believable scenarios.
What's the problem?
Current AI systems that generate videos of action usually only focus on what *one* character does. Real-world environments are full of multiple things happening at once, with agents interacting. Existing models couldn't handle these complex, multi-agent situations and lacked the data to even learn them effectively. They couldn't show consistent views from different agents or understand how actions of one agent affect others.
What's the solution?
The researchers built a system to automatically collect a huge amount of video game data – specifically, 12.64 million frames – showing multiple players interacting in Minecraft. This system allows for coordinated actions and captures both the video and the actions taken. They then used this data to train a model, Solaris, in stages, starting with single-player scenarios and gradually increasing the complexity to include multiple agents. A key part of their training involved a new technique called 'Checkpointed Self Forcing' which helps the model learn over longer periods of time without using too much memory.
Why it matters?
This work is important because it's a step towards creating AI that can understand and predict complex, real-world scenarios involving multiple actors. By open-sourcing their system and models, the researchers hope to encourage further development of 'multi-agent world models,' which could be used for things like training robots, creating more realistic game environments, or even simulating social interactions.
Abstract
Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.