Yume: An Interactive World Generation Model

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang

2025-07-24

Yume: An Interactive World Generation Model

Summary

This paper talks about Yume, a system that creates interactive, high-quality video worlds using a special kind of AI called a Masked Video Diffusion Transformer, along with advanced methods for faster and better video generation.

What's the problem?

Making realistic and explorable video worlds from images is very hard because videos are complex, and existing methods usually take a long time or don’t produce interactive or high-fidelity results.

What's the solution?

The researchers designed Yume to use the Masked Video Diffusion Transformer which understands and generates videos by predicting missing parts in the video frames, combined with smart techniques to speed up the process and make the generated video worlds interactive.

Why it matters?

This matters because Yume can help create virtual environments for gaming, movies, and simulations that look real and respond to user actions, making experiences more immersive and engaging.

Abstract

A framework for generating and exploring interactive, high-fidelity video worlds from images using a Masked Video Diffusion Transformer, advanced sampling techniques, and model acceleration.

View Paper