Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention (2405.11616v3)

Published 19 May 2024 in cs.CV

Abstract: In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images. Specifically, these methods assume that the input images should comply with a predefined camera type, e.g. a perspective camera with a fixed focal length, leading to distorted shapes when the assumption fails. Moreover, the full-image or dense multiview attention they employ leads to an exponential explosion of computational complexity as image resolution increases, resulting in prohibitively expensive training costs. To bridge the gap between assumption and reality, Era3D first proposes a diffusion-based camera prediction module to estimate the focal length and elevation of the input image, which allows our method to generate images without shape distortions. Furthermore, a simple but efficient attention layer, named row-wise attention, is used to enforce epipolar priors in the multiview diffusion, facilitating efficient cross-view information fusion. Consequently, compared with state-of-the-art methods, Era3D generates high-quality multiview images with up to a 512*512 resolution while reducing computation complexity by 12x times. Comprehensive experiments demonstrate that Era3D can reconstruct high-quality and detailed 3D meshes from diverse single-view input images, significantly outperforming baseline multiview diffusion methods. Project page: https://penghtyx.github.io/Era3D/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (118)
  1. Gaussian shell maps for efficient 3d human generation. arXiv preprint arXiv:2311.17857, 2023.
  2. Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. In CVPR, 2023.
  3. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
  4. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  5. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  6. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123–16133, 2022.
  7. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  8. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337, 2023.
  9. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In ICCV, 2021.
  10. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22246–22256, 2023.
  11. Ting Chen. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  12. Cascade-zero123: One image to highly consistent 3d with self-prompted nearby views. arXiv preprint arXiv:2312.04424, 2023.
  13. It3d: Improved text-to-3d generation with explicit view synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1237–1244, 2024.
  14. Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585, 2023.
  15. V3d: Video diffusion models are effective 3d generators. arXiv preprint arXiv:2403.06738, 2024.
  16. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4456–4465, 2023.
  17. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023.
  18. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  19. Gram: Generative radiance manifolds for 3d-aware image generation. In CVPR, 2022.
  20. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  21. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  22. Get3d: A generative model of high quality 3d textured shapes learned from images. NeurIPS, 2022.
  23. Generative adversarial nets. In NeurIPS, 2014.
  24. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985, 2021.
  25. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. 2023.
  26. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  27. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  28. Denoising diffusion probabilistic models. NeurIPS, 2020.
  29. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  30. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
  31. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422, 2023.
  32. Zero-shot text-guided object generation with dream fields. In CVPR, 2022.
  33. Leap: Liberate sparse-view 3d modeling from camera poses. arXiv preprint arXiv:2310.01410, 2023.
  34. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  35. Scaling up gans for text-to-image synthesis. In CVPR, 2023.
  36. Spad: Spatially aware multiview diffusers. arXiv preprint arXiv:2402.05235, 2024.
  37. Holodiffusion: Training a 3d diffusion model using 2d images. In CVPR, 2023.
  38. Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.
  39. Alias-free generative adversarial networks. In NeurIPS, 2021.
  40. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  41. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
  42. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023.
  43. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 2020.
  44. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
  45. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023.
  46. Gaussiandiffusion: 3d gaussian splatting for denoising diffusion probabilistic models with structured noise. arXiv preprint arXiv:2311.11221, 2023.
  47. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. arXiv preprint arXiv:2311.11284, 2023.
  48. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  49. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. arXiv preprint arXiv:2312.13763, 2023.
  50. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 2024.
  51. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023.
  52. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023.
  53. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023.
  54. Unidream: Unifying diffusion priors for relightable text-to-3d generation. arXiv preprint arXiv:2312.08754, 2023.
  55. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  56. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In ECCV, 2022.
  57. Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation. arXiv preprint arXiv:2402.08682, 2024.
  58. Realfusion: 360° reconstruction of any object from a single image. In Arxiv, 2023.
  59. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  60. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022.
  61. Hologan: Unsupervised learning of 3d representations from natural images. In ICCV, 2019.
  62. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  63. Giraffe: Representing scenes as compositional generative neural feature fields. In CVPR, 2021.
  64. Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics (ToG), 32(6):1–11, 2013.
  65. Autodecoding latent 3d diffusion models. arXiv preprint arXiv:2307.05445, 2023.
  66. Dinov2: Learning robust visual features without supervision, 2023.
  67. Stylesdf: High-resolution 3d-consistent image and geometry generation. In CVPR, 2022.
  68. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019.
  69. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, 2019.
  70. Convolutional occupancy networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 523–540. Springer, 2020.
  71. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  72. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  73. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  74. Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142, 2023.
  75. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  76. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  77. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
  78. High-resolution image synthesis with latent diffusion models, 2021.
  79. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2304–2314, 2019.
  80. Zeronvs: Zero-shot 360-degree view synthesis from a single real image. arXiv preprint arXiv:2310.17994, 2023.
  81. Ditto-nerf: Diffusion-based iterative text to omni-directional 3d model. arXiv preprint arXiv:2304.02827, 2023.
  82. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
  83. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023.
  84. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  85. 3d neural field generation using triplane diffusion. In CVPR, 2023.
  86. Epigraf: Rethinking training of 3d gans. In NeurIPS, 2022.
  87. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  88. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459–5469, 2022.
  89. Splatter image: Ultra-fast single-view 3d reconstruction. arXiv preprint arXiv:2312.13150, 2023.
  90. Viewset diffusion:(0-) image-conditioned 3d generative models from 2d data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8863–8873, 2023.
  91. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
  92. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  93. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184, 2023.
  94. Mvdiffusion++: A dense high-resolution multi-view diffusion model for single or sparse-view 3d object reconstruction. arXiv preprint arXiv:2402.12712, 2024.
  95. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint 2307.01097, 2023.
  96. Diffusion with forward models: Solving stochastic inverse problems without direct supervision. Advances in Neural Information Processing Systems, 36, 2024.
  97. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439, 2023.
  98. Consistent view synthesis with pose-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16773–16783, 2023.
  99. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. arXiv preprint arXiv:2403.12008, 2024.
  100. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023.
  101. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
  102. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023.
  103. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024, 2023.
  104. Ibrnet: Learning multi-view image-based rendering. In CVPR, 2021.
  105. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems, 36, 2024.
  106. Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034, 2024.
  107. Meshlrm: Large reconstruction model for high-quality mesh. arXiv preprint arXiv:2404.12385, 2024.
  108. Agg: Amortized generative 3d gaussians for single image to 3d. arXiv preprint arXiv:2401.04099, 2024.
  109. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024.
  110. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation. arXiv preprint arXiv:2403.14621, 2024.
  111. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023.
  112. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
  113. 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225, 2023.
  114. pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021.
  115. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
  116. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  117. Learning visibility field for detailed 3d human reconstruction and relighting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 216–226, 2023.
  118. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. arXiv preprint arXiv:2312.09147, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Peng Li (390 papers)
  2. Yuan Liu (342 papers)
  3. Xiaoxiao Long (47 papers)
  4. Feihu Zhang (15 papers)
  5. Cheng Lin (43 papers)
  6. Mengfei Li (10 papers)
  7. Xingqun Qi (21 papers)
  8. Shanghang Zhang (173 papers)
  9. Wenhan Luo (88 papers)
  10. Ping Tan (101 papers)
  11. Wenping Wang (184 papers)
  12. Qifeng Liu (28 papers)
  13. Yike Guo (144 papers)
Citations (17)

Summary

Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

In this paper, "Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention," the authors introduce Era3D as a novel approach to generating high-resolution multiview images from a single-view image for 3D reconstruction. Unlike existing methods, Era3D mitigates the known issues of camera prior mismatch, inefficiency, and low resolution, commonly present in previous multiview generation techniques.

The primary contribution of Era3D lies in its innovative architectural design which addresses three key challenges: inconsistent predefined camera types, inefficiency in multiview diffusion, and low resolution of generated images. The prior methods like Wonder3D and SyncDreamer are constrained by the assumption that input images comply with a fixed camera type, often leading to distorted shapes and high computational demands due to the dense multiview attention mechanism they employ.

Key Contributions

  1. Diffusion-based Camera Prediction Module: Era3D introduces a novel diffusion-based camera prediction module that estimates the focal length and elevation of the input image. This allows the model to generate multiview images without the shape distortions observed in previous models.
  2. Row-wise Attention for Epipolar Priors: The authors develop a new attention layer called row-wise attention. This layer enforces epipolar priors across multiview images, significantly reducing computational complexity. The comparison shows a reduction in computational complexity by 12 times, making Era3D notably more efficient.
  3. High-Resolution Image Generation: Era3D is capable of generating multiview images at a resolution of up to 512x512 pixels. This is a substantial improvement over existing methods limited to 256x256 pixels, permitting Era3D to reconstruct highly detailed 3D meshes.

Experimental Validation

Comprehensive experiments validate the efficacy of Era3D. Notably, the model outperforms state-of-the-art methods in generating high-quality and detailed 3D meshes from diverse single-view input images. The performance metrics, including Chamfer Distance (CD) and Intersection over Union (IoU), demonstrate significant improvements over the baseline models.

The experiments are conducted on the Objaverse dataset, comprising images with varying focal lengths and viewpoints. The authors highlight the importance of addressing perspective distortions and show that Era3D's approach of using different camera models for input and generated images effectively mitigates these issues.

Technical Insights

  1. Canonical Camera Setting: The approach involves generating multiview images in a canonical camera setting, with orthogonal outputs and fixed viewpoints, irrespective of the input camera type. This design alleviates distortions and ensures consistent multiview image generation.
  2. Efficient Row-wise Multiview Attention: The row-wise attention layer capitalizes on the alignment of epipolar lines with image rows, reducing the need to sample multiple points along epipolar lines. This leads to a significant reduction in memory and computational overhead, with memory consumption and execution times reduced by an order of magnitude compared to dense multiview attention mechanisms.
  3. Regression and Condition Scheme: This scheme leverages UNet feature maps to predict camera parameters, enhancing the accuracy of camera pose predictions. These parameters are then utilized as conditions in the diffusion process, enabling the model to output undistorted images in the canonical setting.

Practical and Theoretical Implications

The practical implications of Era3D are profound. The ability to generate high-resolution, detailed multiview images from a single view has significant applications in areas such as virtual reality, game design, and robotics. The reduction in computational demands through efficient attention mechanisms makes this approach scalable and accessible for real-world applications.

Theoretically, the integration of diffusion-based methods for camera prediction and the introduction of row-wise attention open new avenues for efficient multiview image synthesis. Future research can explore further optimizations in attention mechanisms and the application of Era3D's principles to other domains within AI and computer vision.

Conclusion and Future Directions

Era3D represents a substantial step forward in the field of multiview image generation and 3D reconstruction from single-view images. Its novel approach to handling camera priors and efficient attention mechanisms sets a new benchmark for resolution and efficiency in this domain.

Future developments may include refining the camera prediction models, exploring even higher resolutions, and extending the application of Era3D's techniques to other complex data synthesis tasks. Additionally, the integration of Era3D with other large neural reconstruction models could further enhance its applicability and performance in diverse use cases.

In conclusion, this paper introduces several innovative approaches addressing the limitations of current multiview diffusion models, making significant contributions to the fields of computer vision and 3D reconstruction. Era3D demonstrates how thoughtful architectural design can substantially enhance both the quality and efficiency of generated multiview images.

Reddit Logo Streamline Icon: https://streamlinehq.com