- The paper presents an integrated pipeline combining data generation, backward warping, and sparsity-aware inpainting to achieve real-time stereo video completion.
- It introduces methodological innovations like GAPW for clean occlusion boundaries and PBDP for scalable pseudo-stereo data generation, ensuring geometric consistency.
- Empirical evaluations demonstrate state-of-the-art performance with 30.5 dB PSNR and 24.9 fps on HD videos, alongside robust generalization across multiple datasets.
Real-Time Stereo Inpainting for HD Videos with DreamStereo
Introduction and Motivation
The surge in AR/VR adoption and immersive media consumption has invigorated research in high-resolution stereo video generation and inpainting. The task of stereo video inpainting—filling occluded regions in temporally coherent, geometry-consistent stereo video—remains unresolved, particularly under challenging conditions such as small, sparse occlusions scattered along object boundaries. Prevailing solutions are hindered by lack of large-scale, high-quality stereo datasets, inconsistencies and artifacts in mask generation, and severe computational bottlenecks due to unnecessary processing of the majority of unoccluded image regions. DreamStereo directly addresses these bottlenecks by proposing an integrated data generation, warping, and sparse inpainting pipeline, delivering real-time, state-of-the-art results for HD stereo video.
Methodological Contributions
Gradient-Aware Parallax Warping (GAPW)
GAPW fundamentally improves upon the typical forward warping paradigm, which is prone to introducing scattered, misaligned artifacts at depth discontinuities. By formulating the warping operation in a backward fashion and explicitly calculating the local coordinate mapping gradient (i.e., the Jacobian), GAPW achieves both clean occlusion boundary delineation and suppression of fly-point pixels in multi-layer backgrounds. The occlusion mask is obtained by thresholding the Jacobian’s norm, providing a geometrically meaningful indicator of regions requiring inpainting. This mechanism yields smoother, higher-fidelity occlusion masks compared to prior works such as TrajectoryCrafter, which rely solely on forward warping.
Parallax-Based Dual Projection (PBDP)
Owing to the expense and limited diversity of stereo data acquisition, DreamStereo leverages monocular videos to generate large-scale, high-quality pseudo-stereo training data via PBDP. The approach utilizes GAPW for reprojection, enabling consistent estimation of occlusion masks and warped views from monocular input. Importantly, PBDP produces occlusion masks with correct spatial structure, mitigating the disparities often present in masks constructed via naïve displacement or forward warping. This scalable data construction pipeline allows DreamStereo to achieve effective stereo inpainting pretraining and finetuning without curated binocular video corpora.
Sparsity-Aware Stereo Inpainting (SASI)
Recognizing that typical stereo inpainting masks occupy less than 30% of all video pixels, DreamStereo introduces a task-driven sparsity-aware computational strategy. Mask-guided token selection, informed by dilated occlusion masks, prunes over 70% of the visual tokens from transformer-based Diffusion Inference (DiT) computation. Combined with a 3D-aware, distilled VAE for efficient spatiotemporal encoding/decoding, this pipeline enables a 10.7× acceleration in diffusion inference with less than 2% absolute decrease in quality metrics (PSNR, SSIM, LPIPS). The system achieves 25 frames per second on single A100 GPUs for 768×1280 HD videos, a substantial numerical and practical advancement over prior diffusion-based stereo inpainting models.
Empirical Evaluation
Extensive benchmarking is performed on three stereo datasets: HD-100 (real-world, 4K-derived), Dynamic Replica (synthetic, with ground-truth disparities), and SVD (Apple Vision Pro subset, real data with estimated disparities). Quantitative metrics include PSNR, SSIM, LPIPS, and geometry-consistency evaluations via aligned depth errors.
DreamStereo decisively outperforms both stereo inpainting (e.g., StereoCrafter, ZeroStereo) and general video inpainting baselines (e.g., ProPainter, VACE-1.3B) in terms of quality and latency. Notably:
- On HD-100 at 768×1280: DreamStereo achieves 30.5 dB PSNR and real-time inference at 24.9 fps, exceeding the accuracy and perceptual quality (SSIM=0.900, LPIPS=0.053) of leading baselines, which are orders of magnitude slower or produce visible artifacts.
- Generalization to Synthetic and Real Stereo: On Dynamic Replica and SVD, DreamStereo delivers the best or highly competitive results in all objective metrics, demonstrating superior cross-domain robustness.
- Ablation studies confirm that GAPW-based mask/data generation is essential for high-fidelity completion, while SASI achieves major acceleration with negligible degradation. The system maintains geometric fidelity even as stereo baseline (maximum disparity) increases.
Failure Modes
While the proposed system greatly advances inpainting quality and efficiency, challenges persist for scenes featuring transparency, specular reflections, and extreme high-frequency textures. These failure cases motivate further research on integrating stronger scene priors and advanced geometric reasoning within view synthesis pipelines.
Implications and Future Directions
DreamStereo constitutes a significant technical advancement for stereo video generation and inpainting, especially in AR/VR and immersive video creation at high resolutions. The introduction of mask-guided sparsity addresses one of the fundamental scalability challenges for generative transformers in vision, highlighting a general paradigm for workload reduction in structured generative tasks. The GAPW/PBDP approach substantially relaxes the data-dependence bottleneck for stereo inpainting by enabling scalable pseudo-stereo supervision.
Future directions include:
- Extending sparsity-aware transformer inference to other structured video and 3D generative tasks.
- Incorporating multi-view constraints or explicit geometric priors for real-world challenging scenarios (e.g., handling reflection/transparency).
- Generalizing the data construction and mask-guided inpainting paradigm to broader view synthesis and multi-modal generative settings.
Conclusion
DreamStereo pioneers a fully-integrated, real-time stereo inpainting approach for HD videos, addressing data, geometric, and computational inefficiencies inherent in prior works. Through GAPW, PBDP, and SASI, the method achieves strong numerical results in both fidelity and speed, realizing practical stereo AR/VR video pipelines on commodity hardware and establishing a new benchmark for stereo inpainting research (2604.12270).