WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou

2026-03-18

WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Summary

This paper focuses on improving how well computer-generated gaming worlds respond to player actions and maintain a consistent 3D environment over long periods of gameplay.

What's the problem?

Current interactive gaming world models, which use advanced AI to create game environments, have trouble with two main things: accurately translating player commands into precise movements within the game, and keeping the 3D world visually consistent when players explore for a long time. They often treat player actions as simple instructions instead of recognizing how those actions physically change the camera's position and orientation in the 3D space, leading to jerky movements or unrealistic scenes.

What's the solution?

The researchers tackled this by focusing on the camera's position and angle as a central element. They created a more realistic system for player actions, representing them mathematically to calculate exactly how the camera should move. This information is then fed into the AI model to ensure actions are performed accurately. Additionally, they used the camera's position to remember where the player has been, allowing the AI to create consistent environments even when revisiting previously explored areas. To help with this, they also created a large dataset of real gameplay footage with detailed camera tracking.

Why it matters?

This work is important because it makes generated gaming worlds more immersive and believable. By improving action control and 3D consistency, the AI can create environments that feel more responsive and realistic, leading to a better gaming experience. It pushes the field closer to creating truly interactive and expansive virtual worlds.

Abstract

Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.

View Paper