StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models (2403.04965v2)
Abstract: The demand for stereo images increases as manufacturers launch more XR devices. To meet this demand, we introduce StereoDiffusion, a method that, unlike traditional inpainting pipelines, is trainning free, remarkably straightforward to use, and it seamlessly integrates into the original Stable Diffusion model. Our method modifies the latent variable to provide an end-to-end, lightweight capability for fast generation of stereo image pairs, without the need for fine-tuning model weights or any post-processing of images. Using the original input to generate a left image and estimate a disparity map for it, we generate the latent vector for the right image through Stereo Pixel Shift operations, complemented by Symmetric Pixel Shift Masking Denoise and Self-Attention Layers Modification methods to align the right-side image with the left-side image. Moreover, our proposed method maintains a high standard of image quality throughout the stereo generation process, achieving state-of-the-art scores in various quantitative evaluations.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Hierarchical text-conditional image generation with clip latents, 2022.
- Scaling autoregressive models for content-rich text-to-image generation, 2022.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of Machine Learning Research, volume 162, pages 16784–16804, 2022.
- CogView2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, 35:16890–16902, 2022.
- Prompt-to-prompt image editing with cross attention control, 2022.
- One-2-3-45: any single image to 3D mesh in 45 seconds without per-shape optimization, 2023.
- Multiview compressive coding for 3D reconstruction. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 9065–9075. IEEE, 2023.
- Generalized deep 3D shape prior via part-discretized diffusion process. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 16784–16794. IEEE, 2023.
- AniPortraitGAN: animatable 3D portrait generation from 2D image collections. In SIGGRAPH Asia 2023 Conference Proceedings, pages 51:1–51:9. ACM, 2023.
- PixelSynth: generating a 3D-consistent experience from a single image. In Proceedings of International Conference on Computer Vision (ICCV), pages 14104–14113. IEEE, 2021.
- 3D photography using context-aware layered depth inpainting. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 8028–8038. IEEE, 2020.
- Worldsheet: Wrapping the world in a 3D sheet for view synthesis from a single image. In Proceedings of International Conference on Computer Vision (ICCV), pages 12528–12537. IEEE, 2021.
- Vision transformers for dense prediction. Proceedings of International Conference on Computer Vision (ICCV), pages 12179–12188, 2021.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 2022.
- Denoising diffusion implicit models, 2021.
- MasaCtrl: tuning-free mutual self-attention control for consistent image aynthesis and editing, 2023.
- Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 1921–1930. IEEE, 2023.
- Zero-shot image-to-image translation. In SIGGRAPH 2023 Conference Proceedings, pages 11:1–11:11. ACM, 2023.
- Diffusion models: a comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):105:1–105:39, 2023.
- Adding conditional control to text-to-image diffusion models. In Proceedings of International Conference on Computer Vision (ICCV), pages 3836–3847. IEEE, 2023.
- VideoComposer: compositional video synthesis with motion controllability, 2023.
- Align your latents: high-resolution video synthesis with latent diffusion models. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 22563–22575. IEEE, 2023.
- Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In Proceedings of International Conference on Computer Vision (ICCV), pages 7623–7633. IEEE, 2023.
- Zero-shot video editing using off-the-shelf image diffusion models, 2023.
- Casual 3D photography. ACM Transactions on Graphics, 36(6):234:1–234:15, 2017.
- Reconstructing scenes with mirror and glass surfaces. ACM Transactions on Graphics, 37(4):102:1–102:11, 2018.
- Deep blending for free-viewpoint image-based rendering. ACM Transactions on Graphics, 37(6):257:1–257:15, 2018.
- Image-based rendering in the gradient domain. ACM Transactions on Graphics, 32(6):199:1–199:9, 2013.
- Stereo magnification: learning view synthesis using multiplane images. ACM Transactions on Graphics, 37(4):65:1–65:12, 2018.
- Pushing the boundaries of view extrapolation with multiplane images. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 175–184. IEEE, 2019.
- Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics, 38(4):29:1–29:14, 2019.
- DreamFusion: text-to-3D using 2D diffusion, 2022.
- ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation, 2023.
- Control-a-video: controllable text-to-video generation with diffusion models, 2023.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Learning stereo from single images. In Proceedings of European Conference on Computer Vision (ECCV), pages 722–740. Springer, 2020.
- Score-based generative modeling through stochastic differential equations, 2021.
- Diffusion autoencoders: toward a meaningful and decodable representation. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 10619–10629. IEEE, 2022.
- Improved denoising diffusion probabilistic models, 2021.
- Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models, 2022.
- AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning, 2023.
- Null-text inversion for editing real images using guided diffusion models. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 6038–6047. IEEE, 2023.
- Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- RePaint: inpainting using denoising diffusion probabilistic models. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 11461–11471. IEEE, 2022.
- High-resolution stereo datasets with subpixel-accurate ground truth. In Proceedings of German Conference on Pattern Recognition (GCPR), pages 31–42. Springer, 2014.
- Object scene flow for autonomous vehicles. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 3061–3070. IEEE, 2015.