StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Tjark Behrens, Anton Obukhov, Bingxin Ke, Fabio Tosi, Matteo Poggi, Konrad Schindler

2025-12-12

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Summary

This paper presents a new technique called StereoSpace that creates realistic stereo images (the kind that give a 3D effect) from a single image, without needing to estimate depth or how things would be distorted when viewed from different angles.

What's the problem?

Creating 3D images from a single 2D image is tricky. Existing methods often rely on first figuring out how far away each object is (depth estimation) and then warping the image to simulate a different viewpoint. This can introduce errors and blurriness, especially with complex scenes like those with transparent layers or shiny surfaces. Also, it's hard to fairly compare these methods because they sometimes 'cheat' by using information from the actual 3D data during testing, which isn't realistic.

What's the solution?

StereoSpace takes a different approach. Instead of focusing on depth, it directly learns how the image should change based on the desired viewpoint. It uses a process called 'diffusion,' which gradually refines the image to create the stereo pair. The system is guided by a 'conditioning' signal that tells it what the new viewpoint is. Importantly, the researchers created a new way to test their method that doesn't allow it to use any ground truth 3D information during testing, making the evaluation more honest. They also used metrics that focus on how good the resulting 3D image *looks* and how geometrically consistent it is.

Why it matters?

StereoSpace shows that you can create high-quality 3D images from a single picture without needing to explicitly calculate depth. This is a big step forward because it simplifies the process and makes it more robust to challenging scenes. It opens the door to more scalable and efficient ways to generate 3D content, potentially for virtual reality, augmented reality, or other applications where creating realistic 3D experiences is important.

Abstract

We introduce StereoSpace, a diffusion-based framework for monocular-to-stereo synthesis that models geometry purely through viewpoint conditioning, without explicit depth or warping. A canonical rectified space and the conditioning guide the generator to infer correspondences and fill disocclusions end-to-end. To ensure fair and leakage-free evaluation, we introduce an end-to-end protocol that excludes any ground truth or proxy geometry estimates at test time. The protocol emphasizes metrics reflecting downstream relevance: iSQoE for perceptual comfort and MEt3R for geometric consistency. StereoSpace surpasses other methods from the warp & inpaint, latent-warping, and warped-conditioning categories, achieving sharp parallax and strong robustness on layered and non-Lambertian scenes. This establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.

View Paper