Stereo World Model: Camera-Guided Stereo Video Generation

Yang-Tian Sun, Zehuan Huang, Yifan Niu, Lin Ma, Yan-Pei Cao, Yuewen Ma, Xiaojuan Qi

2026-03-19

Stereo World Model: Camera-Guided Stereo Video Generation

Summary

This paper introduces StereoWorld, a new system for creating realistic stereo videos – videos that give a 3D effect – using only regular color (RGB) video as input. It learns how things look and their 3D shape at the same time, directly from how images differ between two 'eyes' (like our own).

What's the problem?

Creating 3D videos is usually done by either starting with a single video and *guessing* the depth, or by using depth information alongside the color. Guessing depth isn't very accurate, and using extra depth data requires extra steps and isn't always available. Existing methods also struggle to maintain consistency between the views for both eyes, and can be slow to generate.

What's the solution?

StereoWorld solves this by using a clever approach that focuses on understanding the relationship between the two views in a stereo pair. It uses something called 'camera-aware positional encoding' to help the system understand where the camera is and how it's moving, ensuring the generated views stay consistent over time. It also breaks down the complex process of analyzing the entire video into smaller, more manageable parts, specifically focusing on how things line up horizontally between the two views, which is a key clue for understanding depth. This makes the process much faster and more efficient.

Why it matters?

This work is important because it allows for the creation of high-quality 3D videos directly from standard 2D video, without needing extra depth information. This opens up possibilities for things like creating virtual reality experiences, improving how robots understand their surroundings, and generating longer, more interactive 3D content more easily and quickly.

Abstract

We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.

View Paper