Papers
Topics
Authors
Recent
2000 character limit reached

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

Published 27 May 2024 in cs.CV | (2405.17251v2)

Abstract: Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to novel views with estimated depth maps, then the warped image is inpainted by T2I models. However, they struggle with noisy depth maps and loss of semantic details when warping an input view to novel viewpoints. In this paper, we propose a novel approach for single-shot novel view synthesis, a semantic-preserving generative warping framework that enables T2I generative models to learn where to warp and where to generate, through augmenting cross-view attention with self-attention. Our approach addresses the limitations of existing methods by conditioning the generative model on source view images and incorporating geometric warping signals. Qualitative and quantitative evaluations demonstrate that our model outperforms existing methods in both in-domain and out-of-domain scenarios. Project page is available at https://GenWarp-NVS.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  2. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023.
  3. Pixelsynth: Generating a 3d-consistent experience from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14104–14113, 2021.
  4. Simple and effective synthesis of indoor 3d scenes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1169–1178, 2023.
  5. Geometry-free view synthesis: Transformers and no 3d priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14356–14366, 2021.
  6. Long-term photometric consistent novel view synthesis with diffusion models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7071–7081. IEEE, 2023.
  7. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023.
  8. Sherf: Generalizable human nerf from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9352–9364, 2023.
  9. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  10. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  11. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023.
  12. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.
  13. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
  14. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017.
  15. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36, 2024.
  16. Text2immersion: Generative immersive scene with 3d gaussians. arXiv preprint arXiv:2312.09242, 2023.
  17. Synsin: End-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7467–7477, 2020.
  18. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  19. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  20. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  21. Magicanimate: Temporally consistent human image animation using diffusion model. arXiv preprint arXiv:2311.16498, 2023.
  22. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  23. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
  24. Generative novel view synthesis with 3d-aware diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4217–4229, 2023.
  25. Spad: Spatially aware multiview diffusers. arXiv preprint arXiv:2402.05235, 2024.
  26. Infinite nature: Perpetual view generation of natural scenes from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14458–14467, 2021.
  27. Consistent view synthesis with pose-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16773–16783, 2023.
  28. Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion. arXiv preprint arXiv:2404.07199, 2024.
  29. Zeronvs: Zero-shot 360-degree view synthesis from a single real image. arXiv preprint arXiv:2310.17994, 2023.
  30. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  31. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023.
  32. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22560–22570, 2023.
  33. Style aligned image generation via shared attention. arXiv preprint arXiv:2312.02133, 2023.
  34. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023.
  35. Softmax splatting for video frame interpolation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5437–5446, 2020.
  36. Coordgan: Self-supervised dense correspondences emerge from gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10011–10020, 2022.
  37. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33:7537–7547, 2020.
  38. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  39. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  40. Dust3r: Geometric 3d vision made easy. arXiv preprint arXiv:2312.14132, 2023.
  41. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  42. Ep n p: An accurate o (n) solution to the p n p problem. International journal of computer vision, 81:155–166, 2009.
  43. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337, 2023.
  44. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
  45. Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34:19313–19325, 2021.
  46. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  47. Infinitenature-zero: Learning perpetual view generation of natural scenes from single images. In European Conference on Computer Vision, pages 515–534. Springer, 2022.
Citations (5)

Summary

  • The paper introduces GenWarp, a framework that integrates warping signals with generative diffusion to preserve semantic details during novel view synthesis.
  • It employs a two-stream U-net architecture combining a semantic preserver and a diffusion model using warped coordinate embeddings as geometric priors.
  • Experimental results on datasets like RealEstate10K and ScanNet show that GenWarp outperforms baseline methods, improving metrics such as FID and PSNR.

Overview of Semantic-Preserving Generative Warping for Single-Shot Novel View Synthesis

Generating novel views from a single image is a complex task that has seen significant advancements in recent years. The paper "Semantic-Preserving Generative Warping for Single-Shot Novel View Synthesis" introduces a novel framework named GenWarp, designed to address the challenges faced by previous approaches in this domain. GenWarp integrates geometric warping and generative modeling through advanced attention mechanisms to produce high-quality novel views with preserved semantic details.

Introduction

The generation of novel views from a single image is highly relevant for applications such as portrait design, cartoon creation, and movie production. Traditional Text-to-Image (T2I) models like Stable Diffusion exhibit limitations in multi-view generation due to their lack of inherent 3D scene awareness. Recent methods combining T2I models with Monocular Depth Estimation (MDE) offer a promising yet imperfect solution. These methods, which rely on warping input images using depth maps followed by inpainting to fill occluded regions, often struggle with noisy depth predictions and the loss of semantic coherence.

Methodology

GenWarp introduces a more sophisticated approach by explicitly integrating warping signals into the generative process rather than treating it as a separate inpainting task. The core innovation lies in augmenting self-attention mechanisms with cross-view attention, effectively allowing the model to learn where to warp and where to generate content. This integration occurs directly within the attention layers of a diffusion model fine-tuned for this purpose.

Two-Stream Architecture

The proposed architecture includes a semantic preserver network and a diffusion model, both based on the U-net architecture. The semantic preserver encodes the input view into a feature map, while the diffusion model generates the novel view by integrating features from both the input and the novel view. The key novelty is the use of warped coordinate embeddings, which act as geometric priors based on the input image's depth map and desired camera viewpoint. This conditioning facilitates a robust generative process that respects the geometric transformation between views.

Augmented Self-Attention

The model extends the self-attention mechanism by incorporating cross-view attention, enabling the fusion of input view features with the target view features. This hybrid attention mechanism allows the model to balance between generating new content and warping existing features accurately. By concatenating the self-attention map and the cross-view attention map, the model aligns semantic details from the input view with the generated novel view, preserving consistency and reducing artifacts.

Performance Evaluation

The efficacy of GenWarp is validated through extensive experiments on datasets such as RealEstate10K, ScanNet, and in-the-wild images. Qualitative results demonstrate GenWarp's superior capability in generating coherent and contextually consistent novel views, even in challenging scenarios involving large viewpoint changes. Quantitative metrics, including FID and PSNR, further corroborate these findings, showing that GenWarp outperforms baseline methods like GeoGPT and traditional warping-and-inpainting approaches.

Implications and Future Work

GenWarp addresses a crucial limitation in novel view synthesis by effectively combining depth-based warping with advanced generative modeling. This method not only enhances the quality of generated views but also ensures semantic coherence, making it highly applicable to various real-world scenarios. Future research could explore optimizing the attention mechanisms further, integrating additional sensor inputs, or expanding to more complex scene understanding tasks.

In conclusion, GenWarp represents a significant step forward in single-shot novel view synthesis, leveraging sophisticated attention mechanisms to achieve higher fidelity and more semantically accurate generative results. This work lays the groundwork for future innovations in combining geometric transformations with generative modeling, potentially leading to more advanced applications in AI-driven content creation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 5 tweets with 90 likes about this paper.