Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li
2026-04-13
Summary
This paper introduces Matrix-Game 3.0, a new system for creating long, realistic videos in real-time using artificial intelligence. It builds on previous work to make AI-generated videos more consistent and higher quality.
What's the problem?
Current AI video generation struggles with two main issues: remembering details over long videos and creating high-resolution videos quickly enough for real-time use. Imagine trying to make a movie where characters and objects stay consistent throughout, while also making it happen live – that’s the challenge. Existing systems either can’t maintain consistency over time or can’t generate videos fast enough at a good resolution.
What's the solution?
The researchers tackled this by improving three key areas. First, they created a huge dataset of videos, poses, actions, and descriptions using game engines, existing games, and real-world footage. Second, they trained the AI to learn from its mistakes by predicting what should happen next and then correcting itself. They also gave the AI a 'memory' to help it remember things from earlier in the video. Finally, they streamlined the AI to make it run faster by simplifying the model and using clever techniques to reduce the amount of computing power needed.
Why it matters?
This work is important because it brings us closer to creating practical, AI-powered 'world models' – essentially, AI that can simulate and interact with realistic environments. This has huge potential for things like video game development, virtual reality, and even creating training simulations, making these technologies more accessible and realistic.
Abstract
With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.