Geometry-Aware Monocular-to-Stereo Generation
- Geometry-aware monocular-to-stereo video generation is the process of converting single-view videos into stereo pairs while preserving accurate 3D structure and depth cues.
- Techniques encompass explicit depth warping, implicit generative methods, and hybrid self-supervised frameworks to enforce spatial and temporal consistency.
- Successful models integrate occlusion handling, epipolar constraints, and temporal coherence to minimize artifacts such as flicker, ghosting, and view inconsistency.
Geometry-aware monocular-to-stereo video generation refers to the task of transforming a monocular video sequence into a temporally coherent stereo video pair, such that the generated sequence exhibits both correct binocular disparity and consistent 3D structural cues across time. This field lies at the intersection of multi-view geometry, neural rendering, video synthesis, and deep generative modeling. The objective is not only to infer plausible right (or left) views given a monocular video but also to ensure that the stereo output encodes metric or semimetric scene geometry, yielding perceptually convincing depth and minimizing artifacts such as flicker, ghosting, or view inconsistency.
1. Categories of Geometry-Aware Monocular-to-Stereo Methods
Geometry-aware monocular-to-stereo video generation methods are best organized along three axes:
- Explicit-geometry-based approaches: These rely on estimated depth or disparity maps (often framewise or video-wide) as explicit geometric scaffolds. Depth information is used to warp the input frames, and the resulting occlusions or holes are subsequently filled by dedicated inpainting or synthesis modules. Key representatives include SpatialMe (Zhang et al., 2024), various modern frame-matrix pipelines (SVG (Dai et al., 2024), S²VG (Dai et al., 11 Aug 2025)), and the FMDN pipeline (Wang et al., 2019).
- Implicit geometry or warping-free synthesis: These methods avoid explicit depth estimation and instead leverage strong priors over multi-view transformations learned from stereo or multi-view video datasets, utilizing generative models such as latent diffusion or transformers. Typical examples are Elastic3D (Metzger et al., 16 Dec 2025), StereoPilot (Shen et al., 18 Dec 2025), StereoWorld (Xing et al., 10 Dec 2025), Eye2Eye (Geyer et al., 30 Apr 2025), and the viewpoint-conditioned diffusion in StereoSpace (Behrens et al., 11 Dec 2025).
- Hybrid and self-supervised approaches: These combine explicit geometric reasoning (e.g., forward-backward warping, photometric consistency) with neural generative components, and may use self-supervised proxy tasks or synthetic data generation to bootstrap learning when stereo data is scarce. Notable frameworks include SpatialDreamer (Lv et al., 2024), F3D-Gaus (Wang et al., 12 Jan 2025), ViDAR (Nazarczuk et al., 23 Jun 2025), and Restereo (Huang et al., 6 Jun 2025).
2. Fundamental Geometric Principles and Representations
Most state-of-the-art systems incorporate geometric inductive bias in one or more forms:
- Epipolar Geometry: Constraints arising from the rigid relationship between two views (known or rectified stereo rigs), such that correspondences must lie on epipolar lines. FMDN (Wang et al., 2019) actively constrains optical flow to epipolar lines, reducing the flow search space and enforcing photometric alignment along physically plausible correspondences. Elastic3D's decoder block applies epipolar cross-attention, attending only along scanlines in the corresponding feature maps to ensure stereo consistency (Metzger et al., 16 Dec 2025).
- Depth/Disparity Warping: Depth maps (D(x)) or disparity fields (d(x)) derived from monocular depth estimators or from learned models are used to project each pixel in the input frame into the target viewpoint under known or virtual stereo geometry:
- For a canonical rectified stereo pair, the transformation is often , with disparity (SpatialMe, S²VG, SVG).
- In full projective settings, the 3D position is reprojected via the right camera's intrinsics and extrinsics (Shvetsova et al., 22 May 2025, Zhang et al., 2024).
- Implicit Multi-view Correspondence: Warping-free synthesis pipelines sidestep explicit geometric computations and rely on end-to-end learned mappings from source to target view, guided by multi-view training data and sometimes conditional tokens encoding viewpoint or disparity. These models often use 3D-aware architecture features; for instance, StereoWorld's transformer blocks permit both temporal and spatial attention across frames and views (Xing et al., 10 Dec 2025).
- Learnable or Canonicalized Viewpoint Parameterizations: StereoSpace injects dense Plücker ray embeddings as conditioning, enabling the network to model arbitrary relative stereo shifts without explicit depth (Behrens et al., 11 Dec 2025). StereoPilot employs a learnable domain switcher distinguishing parallel vs. converged stereo geometries, which is added to the diffusion timestep embedding (Shen et al., 18 Dec 2025).
3. Model Architectures and Training Strategies
The most effective geometry-aware monocular-to-stereo video generation models exploit both architectural constraints and targeted losses that directly enforce geometric and temporal consistency.
Depth-Warp–Inpaint Pipelines
Frameworks such as SpatialMe (Zhang et al., 2024), S²VG (Dai et al., 11 Aug 2025), and SVG (Dai et al., 2024) follow a modular structure:
- Depth estimation (via fine-tuned monocular depth networks or external models like Depth Anything) for each frame.
- Warping based on estimated depth/disparity and virtual stereo baseline.
- Occlusion mask generation by detecting holes emerging from forward warping.
- Inpainting performed by multi-branch or frame-matrix schemes that use both polygonal geometric heuristics and deep spatio-temporal inpainting.
- Fusion and refinement through mask-based hierarchical units or frame-matrix alternations, ensuring boundary artifacts are suppressed (boundary re-injection).
Losses typically combine occlusion-weighted content reconstruction, perceptual (VGG/LPIPS), and, less frequently, adversarial and warping consistency terms.
End-to-End Warping-Free Generative Approaches
Latent diffusion architectures (Elastic3D (Metzger et al., 16 Dec 2025), StereoPilot (Shen et al., 18 Dec 2025), Eye2Eye (Geyer et al., 30 Apr 2025), StereoWorld (Xing et al., 10 Dec 2025)) encode the input video with a frozen or learnable VAE, then transform left-view latents into right-view latents via feed-forward or minimal-step denoising UNets, transformers, or DiT blocks.
- Conditioning mechanisms: inclusion of spatial tokens (e.g., disparity dial in Elastic3D), domain switchers (StereoPilot), frame-concatenation (StereoWorld), or injected Plücker rays (StereoSpace).
- Geometry-aware decoding: Elastic3D introduces a decoder with epipolar cross-attention, copying detail along scanlines.
- Losses: Latent L2/diffusion losses for distillation; RGB/SSIM/LPIPS reconstruction; optional patch/epipolar consistency (Metzger et al., 16 Dec 2025, Xing et al., 10 Dec 2025).
Self-supervised and Fusion Models
Hybrid methods utilize both pseudo-stereo data synthesization (SpatialDreamer DVG module (Lv et al., 2024)) and cycle-aggregative training (F3D-Gaus (Wang et al., 12 Jan 2025)) to enforce cross-view consistency without dense stereo datasets.
- Forward-backward rendering with depth and optical flow for mask smoothing.
- Cycle consistency constraints: recurrently encode synthesized novel views and aggregate new Gaussian primitives.
- Video-specific consistency modules: e.g., Temporal Interaction Learning in SpatialDreamer, frame-matrix denoising coupled with boundary re-injection in SVG/S²VG.
4. Handling Occlusions, Disocclusions, and Temporal Consistency
The unique challenge of video vs. single-frame conversion is the requirement for temporal consistency in occlusion/disocclusion handling:
- Occlusion masks are generated from warping steps and are then refined via optical flow or temporal mask propagation (SpatialDreamer (Lv et al., 2024)), forward-backward depth checks (FMDN (Wang et al., 2019)), and explicit mask passing to the inpainting module (M2SVid (Shvetsova et al., 22 May 2025)).
- Frame-matrix or sequence-based inpainting ensures that content synthesis is coherent both across frames and spatial views (SVG/S²VG).
- Temporal diffusion or attention: Methods such as StereoWorld (Xing et al., 10 Dec 2025) and Elastic3D (Metzger et al., 16 Dec 2025) incorporate spatio-temporal blocks or attention to propagate appearance and disparity context across the video volume.
- Consistency control modules: E.g., Temporal Interaction Learning in SpatialDreamer, which cross-attends to neighboring frames’ features for spatio-temporal alignment.
5. Evaluation Protocols and Empirical Performance
Approaches are evaluated using a suite of image and video metrics:
| Metric | Role | Used by |
|---|---|---|
| PSNR / SSIM / MS-SSIM | Frame-level fidelity | StereoWorld, M2SVid, SpatialMe, StereoPilot |
| LPIPS | Perceptual similarity | StereoWorld, Elastic3D, SpatialMe, Restereo |
| EPE / D1-all | Disparity/geometric error | StereoWorld |
| SIOU/CLIP Score | Semantic stereo/view consistency | StereoPilot, SVG/S²VG |
| iSQoE / MEt3R | Human perception, geometric quality | StereoSpace, Eye2Eye |
| FVD, IQ-Score, TF-Score | Temporal/flicker, video quality | StereoWorld, SpatialDreamer |
Notable results include:
- SpatialMe surpasses all baselines in MAE, LPIPS, SSIM, PSNR: e.g., 0.0318/0.0478/0.8522/31.45 (Zhang et al., 2024).
- StereoWorld outperforms previous methods with ~26 PSNR, 0.796 SSIM, 0.0952 LPIPS, 17.45 px EPE (Xing et al., 10 Dec 2025).
- Elastic3D achieves state-of-the-art 3D perception with >80% pairwise preference in VR studies (Metzger et al., 16 Dec 2025).
- SVG/S²VG show that the training-free frame-matrix inpainting with depth warping is competitive and robust across baselines, especially on new generative video models (Dai et al., 2024, Dai et al., 11 Aug 2025).
Performance ablations reveal that geometry-aware losses (explicit depth/disparity or cross-view consistency) directly impact stereo parallax fidelity, occlusion edge sharpness, and temporal stability. Temporal tiling, spatial tiling, and attention modules mitigate flicker and preserve high-resolution detail in longer or higher-resolution sequences.
6. Limitations and Future Directions
The principal limitations reported are:
- Extreme disparity and occlusion: Wide-baseline scenes, scenes with heavy occlusions, or discontinuous depth remain hard for all approaches; explicit warping pipelines are brittle on reflective/transparent surfaces, while warping-free models can drift in planar or ambiguous structures (Metzger et al., 16 Dec 2025, Geyer et al., 30 Apr 2025).
- Resolution and scalability: Feed-forward and diffusion models are limited by memory constraints; segmentation into tiles (StereoWorld) or short clips (SVG/S²VG) is often required (Xing et al., 10 Dec 2025, Dai et al., 11 Aug 2025).
- Dependence on depth estimation: Depth-based pipelines are sensitive to quality of monocular depth estimation; errors propagate to warping and inpainting modules (Zhang et al., 2024, Shvetsova et al., 22 May 2025).
- Loss of fine detail or stereo rivalry: VAE decoders can lose high-frequency content or introduce artifacts if decoded naïvely; specialized guided decoders (Elastic3D) or mask-driven refinement is beneficial (Metzger et al., 16 Dec 2025).
- Data requirements: Warping-free and hybrid models need large stereo video datasets or synthetic proxy data; self-supervised, cycle-based, or synthetic data solutions alleviate but do not eliminate this requirement (Lv et al., 2024, Shen et al., 18 Dec 2025).
Ongoing research explores mesh- or radiance field-based representations, spatio-temporal attention modules with larger receptive field, improved depth or multi-view proxy supervision, and bidirectional VR-specific perception metrics for end-to-end optimization.
7. Summary Table of Representative Methods
| Method | Approach Type | Explicit Geometry | Temporal Consistency | Stereo Format Handling |
|---|---|---|---|---|
| FMDN (Wang et al., 2019) | Flow + depth, 2-view fusion | Yes (flow+depth) | Yes, via fusion | Arbitrary camera motion |
| SpatialMe (Zhang et al., 2024) | Depth–warp–inpaint (multi-branch) | Yes | Yes | Rectified/parallel |
| SVG/S²VG (Dai et al., 2024, Dai et al., 11 Aug 2025) | Depth–warp + frame-matrix diffusion | Yes | Yes (joint inpaint) | Arbitrary/canonical |
| Elastic3D (Metzger et al., 16 Dec 2025) | Warping-free, latent diffusion | No, epipolar prior | Yes (latent, guided) | Any (user disparity dial) |
| StereoPilot (Shen et al., 18 Dec 2025) | Feed-forward transformer | No, prior via attn | Yes | Parallel/converged (switch) |
| StereoWorld (Xing et al., 10 Dec 2025) | Diffusion transformer + explicit loss | Soft (disparity loss) | Yes (spatio-temp) | Parallel/all (dataset) |
| Eye2Eye (Geyer et al., 30 Apr 2025) | Warping-free, direct diffusion | No | Yes | Parallel (rectified) |
| M2SVid (Shvetsova et al., 22 May 2025) | Depth-warp + SVD-based inpaint | Yes | Yes (full-attn mask) | Rectified/parallel |
| SpatialDreamer (Lv et al., 2024) | Self-sup. + temporal attn | Yes (DVG) | Yes (TIL) | Arbitrary, synthetic stereo |
| F3D-Gaus (Wang et al., 12 Jan 2025) | 3D Gaussian splatting + cycle | Yes | Yes | Multi-view/stereo |
All cited methods converge on the necessity of geometry-awareness—either via explicit depth/disparity modeling or via multi-view priors—to achieve robust and artifact-free monocular-to-stereo video conversion in real-world, dynamic scenes.