Yume-1.5: A Text-Controlled Interactive World Generation Model
Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang
2025-12-30
Summary
This paper introduces a new system, called \method, for creating realistic and interactive virtual worlds that you can explore, starting from just a single image or a text description.
What's the problem?
Existing methods for generating these kinds of worlds using diffusion models have some major drawbacks. They often require huge amounts of computing power, take a long time to create each frame, and struggle to keep track of everything happening in the world as you explore it. Plus, they usually can't easily respond to specific text commands to change what's going on.
What's the solution?
\method tackles these issues with three main ideas. First, it efficiently compresses information about the world's history to avoid slowing down. Second, it speeds up the process of creating each frame using a clever technique called attention distillation and a better way to understand text. Finally, it allows you to use text prompts to trigger specific events within the generated world, giving you more control.
Why it matters?
This work is important because it makes creating and exploring these interactive worlds much more practical. By reducing the computational demands and adding text control, it opens the door to real-time experiences and more dynamic, user-driven virtual environments.
Abstract
Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.