StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

We present StereoSpace, a method for generating stereo from monocular images. Left: Built on a foundational LDM, our framework efficiently leverages learned priors for end-to-end view synthesis. The target baseline in world units acts as conditioning for precise view control. Images featuring the dragon are illustrative examples. Right: Implicit scene understanding allows us to tackle the most complex cases where geometry cues alone are insufficient for novel view synthesis. Best viewed zoomed-in. Legend: squares — warping, circles — breaks, lines — bends, arrows — ghosting. StereoSpace consistently outperforms recent monocular competition, including generative 3DGS models like Lyra.

We present StereoSpace, a method for generating stereo from monocular images. Built on a foundational LDM, our framework efficiently leverages learned priors for end-to-end view synthesis. The target baseline in world units acts as conditioning for precise view control. Images featuring the dragon are illustrative examples.

Implicit scene understanding allows us to tackle the most complex cases where geometry cues alone are insufficient for novel view synthesis. Best viewed zoomed-in. Legend: squares — warping, circles — breaks, lines — bends, arrows — ghosting. StereoSpace consistently outperforms recent monocular competition, including generative 3DGS models like Lyra.

StereoSpace is a diffusion-based framework for monocular-to-stereo synthesis that models geometry purely through viewpoint conditioning, without explicit depth or warping. A canonical rectified space and the conditioning guide the generator to infer correspondences and fill disocclusions end-to-end. To ensure fair and leakage-free evaluation, we introduce an end-to-end protocol that excludes any ground truth or proxy geometry estimates at test time. The protocol emphasizes metrics reflecting downstream relevance: iSQoE for perceptual comfort and MEt3R for geometric consistency. StereoSpace surpasses other methods from the warp & inpaint, latent-warping, and warped-conditioning categories, achieving sharp parallax and strong robustness on layered and non-Lambertian scenes. This establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.

StereoSpace shows strong generalization on in-the-wild imagery. The examples in this gallery include challenging reflective surfaces that never appear during training, yet the model produces consistent, high-quality results, indicating robust out-of-distribution behavior. We show the outputs in two common stereo formats: anaglyph and side-by-side, using the two components below.

Use anaglyph glasses to watch the results of StereoSpace in stereo:

Use VR or any SBS (side-by-side) compatible viewer to watch the results of StereoSpace in stereo:

The component below provides a qualitative comparison with several recent methods on the same scenes. You can switch between scenes using the previews at the top and compare methods by selecting them with the buttons inside the main view.

Idea

Unlike recent methods grouped under "warp-and-inpaint", "latent warping", or "warped conditioning", each relying on explicit monocular geometry estimates that are difficult and error-prone, we instead let the foundation diffusion model infer 3D structure implicitly and train stereo generation end-to-end. The resulting model outperforms all these categories and even surpasses Lyra, a strong generative 3DGS-based competitor. Qualitative examples are shown in the slider above.

Contributions

Single-image conditional generation of counterpart views, free from explicit geometric shortcuts;
End-to-end training procedure for efficient transfer of the rich task-agnostic foundation prior to the task at hand;
A novel perceptual and geometry-aware evaluation.

Model Architecture for End-to-End View Synthesis

To produce the right view, our model uses a dual U-Net initialized from Stable Diffusion v2.0. The top branch operates on the source view latent as well as the viewpoint condition. The target baseline is encoded similarly and is concatenated with the latent code of the counterpart view. Latent and pixel-space losses supervise fine-tuning, wherein target view synthesis leverages source view features through end-to-end cross attention. Red arrows denote operations at training time only.

Quantitative Comparison with Other Recent Methods

We compare the baselines using two complementary metrics: iSQoE, which reflects stereo viewing comfort and perceptual coherence, and MEt3R, which captures geometric consistency across the rendered views. Interpreted together, these metrics make the performance differences clear. Diffusion-based inpainting methods score worse in both perceptual and geometric metrics, reflecting their tendency toward smooth but structurally unreliable completions. Warping-based models improve geometry but still struggle in occluded regions, which the metrics reveal. Depth-conditioned GenStereo performs more stably, especially on simpler scenes. Across datasets, StereoSpace is the only method improving both perceptual comfort and geometric consistency simultaneously, indicating that integrating geometry directly into generation in an end-to-end fashion is most effective.

Refer to the pdf paper linked above for more details on qualitative, quantitative, and ablation studies.

Citation

@misc{behrens2025stereospace,
  title        = {StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space},
  author       = {Tjark Behrens and Anton Obukhov and Bingxin Ke and Fabio Tosi and Matteo Poggi and Konrad Schindler},
  year         = {2025},
  eprint       = {2512.10959},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2512.10959},
}

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Abstract

Gallery

Qualitative Comparison with Other Recent Methods

Our Approach

Idea

Contributions

Model Architecture for End-to-End View Synthesis

Quantitative Comparison with Other Recent Methods

Citation