Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SinMPI: Novel View Synthesis from a Single Image with Expanded Multiplane Images (2312.11037v1)

Published 18 Dec 2023 in cs.CV

Abstract: Single-image novel view synthesis is a challenging and ongoing problem that aims to generate an infinite number of consistent views from a single input image. Although significant efforts have been made to advance the quality of generated novel views, less attention has been paid to the expansion of the underlying scene representation, which is crucial to the generation of realistic novel view images. This paper proposes SinMPI, a novel method that uses an expanded multiplane image (MPI) as the 3D scene representation to significantly expand the perspective range of MPI and generate high-quality novel views from a large multiplane space. The key idea of our method is to use Stable Diffusion to generate out-of-view contents, project all scene contents into an expanded multiplane image according to depths predicted by monocular depth estimators, and then optimize the multiplane image under the supervision of pseudo multi-view data generated by a depth-aware warping and inpainting module. Both qualitative and quantitative experiments have been conducted to validate the superiority of our method to the state of the art. Our code and data are available at https://github.com/TrickyGo/SinMPI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1209–1218.
  2. Pix2NeRF: Unsupervised Conditional p-GAN for Single Image to Neural Radiance Fields Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3981–3990.
  3. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650–9660.
  4. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20637–20647.
  5. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  6. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12882–12891.
  7. Deepview: View synthesis with learned gradient descent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2367–2376.
  8. Portrait neural radiance fields from a single image. arXiv preprint arXiv:2012.05903 (2020).
  9. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
  10. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In International Conference on Machine Learning. PMLR, 11808–11826.
  11. Single-View View Synthesis in the Wild with Learned Adaptive Multiplane Images. arXiv preprint arXiv:2205.11733 (2022).
  12. Richard Hartley and Andrew Zisserman. 2003. Multiple view geometry in computer vision. Cambridge university press.
  13. M3VSNet: Unsupervised multi-metric multi-view stereo network. In 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 3163–3167.
  14. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5885–5894.
  15. SLIDE: Single image 3d photography with soft layering and depth-aware inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12518–12527.
  16. Large scale multi-view stereopsis evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 406–413.
  17. Mine: Towards continuous depth mpi with nerf for novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12578–12588.
  18. Qinbo Li and Nima Khademi Kalantari. 2020. Synthesizing light field from a single image with variable MPI and two network fusion. ACM Trans. Graph. 39, 6 (2020), 229–1.
  19. InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images. In European Conference on Computer Vision. Springer, 515–534.
  20. Vision transformer for nerf-based view synthesis from a single input image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 806–815.
  21. Infinite nature: Perpetual view generation of natural scenes from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14458–14467.
  22. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–14.
  23. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
  24. Edgeconnect: Structure guided image inpainting using edge prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0–0.
  25. A solution to the hidden surface problem. In Proceedings of the ACM annual conference-Volume 1. 443–450.
  26. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12179–12188.
  27. Pixelsynth: Generating a 3d-consistent experience from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14104–14113.
  28. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
  29. Layered depth images. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques. 231–242.
  30. 3d photography using context-aware layered depth inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8028–8038.
  31. Pushing the boundaries of view extrapolation with multiplane images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 175–184.
  32. Learning to synthesize a 4D RGBD light field from a single image. In Proceedings of the IEEE International Conference on Computer Vision. 2243–2251.
  33. Richard Szeliski and Polina Golland. 1999. Stereo matching with transparency and matting. International Journal of Computer Vision 32, 1 (1999), 45–61.
  34. Carlo Tomasi and Roberto Manduchi. 1998. Bilateral filtering for gray and color images. In Sixth international conference on computer vision (IEEE Cat. No. 98CH36271). IEEE, 839–846.
  35. Richard Tucker and Noah Snavely. 2020. Single-view view synthesis with multiplane images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 551–560.
  36. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628 (2022).
  37. Synsin: End-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7467–7477.
  38. Nex: Real-time view synthesis with neural basis expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8534–8543.
  39. SinNeRF: Training Neural Radiance Fields on Complex Scenes from a Single Image. arXiv preprint arXiv:2204.00928 (2022).
  40. Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. Advances in Neural Information Processing Systems 32 (2019).
  41. Learning to recover 3d scene shape from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 204–213.
  42. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4578–4587.
  43. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
  44. Generative multiplane images: Making a 2d gan 3d-aware. In European Conference on Computer Vision. Springer, 18–35.
  45. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40, 6 (2017), 1452–1464.
  46. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition. 633–641.
  47. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817 (2018).
Citations (6)

Summary

  • The paper introduces SinMPI, a method that expands multiplane representations to synthesize unlimited 3D-consistent novel views from a single image.
  • It leverages outpainting with Stable Diffusion and monocular depth estimation to overcome depth discretization and texture artifacts.
  • Experimental results demonstrate superior realism and flexible viewpoint generation compared to state-of-the-art MPI-based methods.

Introduction to SinMPI

The paper addresses the challenge of novel view synthesis from a single image, which is the generation of an infinite number of consistent views from one input image. A new method called SinMPI (Single Image with Expanded Multiplane Image) is proposed to expand the scene representation significantly, thus allowing a broader range of camera perspectives.

Scene Representation with SinMPI

At the core of SinMPI is an enhanced multiplane image (MPI) representation that expands the perspective range beyond the input image's original scene. This advancement is crucial for creating realistic and consistent novel views. Previous MPI-based methods were limited to the original camera frustum, leading to depth discretization and repeated texture artifacts. The new approach in SinMPI overcomes these issues by representing the expanded 3D scene as learnable parameters optimized through volume rendering.

Technique and Pipeline

The methodology used in SinMPI can be described in the following stages:

  1. Outpainting: An image generator based on Stable Diffusion is employed to create out-of-view contents, extending the visual scene information beyond the single available view.
  2. Depth Prediction: The extended scene content and the original input are assigned depth values using monocular depth estimators.
  3. Pseudo Multi-view Generation: The depth-aware warping and inpainting module generates additional views by projecting the overall scene into an expanded multiplane image.
  4. Optimization: The expanded MPI is refined under the guidance of these pseudo multi-view data. The learnable MPI parameters enable efficient handling of the complex scenes and improve the rendition of novel viewpoints.

Experimental Results

The paper carries out both qualitative and quantitative experiments on multiple datasets, demonstrating the method's superiority in generating 3D-consistent novel views. The enhanced approach results in a substantial improvement over the existing state-of-the-art methods in terms of realism and expansion capabilities.

Conclusion and Future Work

SinMPI presents a significant step forward for single-image novel view synthesis by allowing for unrestricted observations and fast 3D-consistent processing. While it pushes the boundaries of scene expansion and render quality, it acknowledges limitations like reliance on depth estimate accuracy and the challenge in reproducing light effects such as specular reflections. The authors suggest potential for future enhancements, particularly in addressing these limitations and incorporating additional realistic view-dependent effects.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com