WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo

2025-12-17

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Summary

This paper introduces WorldPlay, a new computer program that creates interactive 3D worlds from video in real-time, meaning you can explore and change these worlds as they're being generated.

What's the problem?

Existing methods for creating these interactive 3D worlds face a challenge: they either run quickly but forget details from earlier in the video, leading to inconsistencies, or they remember everything but are too slow to be truly interactive. Basically, it's hard to balance speed and accuracy when building a 3D world from a video stream.

What's the solution?

The creators of WorldPlay tackled this problem with three main ideas. First, they developed a way for the program to understand and respond to user actions like mouse clicks and keyboard presses. Second, they created a 'memory' system that constantly rebuilds the context of the scene, focusing on important past frames to prevent details from fading over time. Finally, they used a technique called 'Context Forcing' to train a faster version of the program without losing its ability to remember long-term details, ensuring it stays accurate and doesn't drift into errors.

Why it matters?

WorldPlay is important because it allows for the creation of detailed and consistent 3D worlds in real-time. This opens up possibilities for things like interactive movies, virtual reality experiences, and even tools for designers and artists, all while requiring less computing power than previous methods.

Abstract

This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

View Paper