StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
Ke Xing, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Xiaojie Jin, Yao Zhao, Yunchao Wei
2025-12-11
Summary
This paper introduces StereoWorld, a new system for creating realistic 3D videos from regular 2D videos, aiming to make high-quality virtual and augmented reality experiences more accessible.
What's the problem?
Currently, making good 3D videos for things like VR and AR headsets is expensive and often results in videos that don't look quite right, with noticeable flaws or distortions. It's hard to automatically turn a normal video into a convincing 3D experience.
What's the solution?
The researchers used a pre-existing AI model that's good at creating videos and adapted it to generate a second view of the scene, creating the 3D effect. They trained this model using a huge dataset of real 3D videos and added a special technique to make sure the generated 3D view matches the original video's geometry, ensuring things look structurally sound. They also broke the video down into smaller pieces to make the process faster and allow for higher resolution.
Why it matters?
This work is important because it offers a way to create high-quality 3D videos much more easily and cheaply than current methods. This could significantly lower the barrier to entry for creating immersive experiences for virtual and augmented reality, making them more widespread and accessible to everyone.
Abstract
The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.