StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, Ying-Cong Chen
2025-12-19
Summary
This paper focuses on improving how we automatically create 3D videos from regular 2D videos, a process called monocular-to-stereo conversion.
What's the problem?
Currently, making 3D videos is expensive and difficult. Existing automatic methods rely on a series of steps – estimating depth, warping the image, and then filling in missing parts – but this process often introduces errors that build up, struggles with unclear depth information, and doesn't work well with different 3D display setups.
What's the solution?
The researchers created a large, new dataset called UniStereo containing lots of 3D video examples in different formats. They then developed a new model called StereoPilot that directly creates the 3D view without needing to first guess the depth of the scene. StereoPilot learns to switch between different 3D formats and ensures the resulting 3D video looks consistent, using a clever technique to check its work.
Why it matters?
This work is important because it makes creating high-quality 3D videos easier and faster. StereoPilot is better than previous methods, producing more realistic 3D videos while also being more efficient, which is crucial for things like virtual reality and 3D movies.
Abstract
The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.