Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization (2306.16928v1)

Published 29 Jun 2023 in cs.CV, cs.AI, and cs.RO

Abstract: Single image 3D reconstruction is an important but challenging task that requires extensive knowledge of our natural world. Many existing methods solve this problem by optimizing a neural radiance field under the guidance of 2D diffusion models but suffer from lengthy optimization time, 3D inconsistency results, and poor geometry. In this work, we propose a novel method that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. Given a single image, we first use a view-conditioned 2D diffusion model, Zero123, to generate multi-view images for the input view, and then aim to lift them up to 3D space. Since traditional reconstruction methods struggle with inconsistent multi-view predictions, we build our 3D reconstruction module upon an SDF-based generalizable neural surface reconstruction method and propose several critical training strategies to enable the reconstruction of 360-degree meshes. Without costly optimizations, our method reconstructs 3D shapes in significantly less time than existing methods. Moreover, our method favors better geometry, generates more 3D consistent results, and adheres more closely to the input image. We evaluate our approach on both synthetic data and in-the-wild images and demonstrate its superiority in terms of both mesh quality and runtime. In addition, our approach can seamlessly support the text-to-3D task by integrating with off-the-shelf text-to-image diffusion models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (95)
  1. Learning representations and generative models for 3d point clouds. In International conference on machine learning, pages 40–49. PMLR, 2018.
  2. Clipface: Text-guided editing of textured 3d morphable models. arXiv preprint arXiv:2212.01406, 2022.
  3. Text and image guided 3d avatar generation and manipulation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4421–4431, 2023.
  4. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  5. Tensorf: Tensorial radiance fields. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, pages 333–350. Springer, 2022.
  6. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021.
  7. Text2tex: Text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396, 2023.
  8. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. arXiv preprint arXiv:2212.04493, 2022.
  9. Automatic class-specific 3d reconstruction from a single image. CSAIL, pages 1–9, 2009.
  10. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 628–644. Springer, 2016.
  11. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022.
  12. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. arXiv preprint arXiv:2212.03267, 2022.
  13. Blenderproc2: A procedural pipeline for photorealistic rendering. Journal of Open Source Software, 8(82):4901, 2023.
  14. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  15. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017.
  16. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  17. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841–31854, 2022.
  18. Learning a predictable and generative vector representation for objects. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 484–499. Springer, 2016.
  19. A papier-mâché approach to learning 3d surface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 216–224, 2018.
  20. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  21. Unsupervised learning of 3d object categories from videos in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4700–4709, 2021.
  22. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
  23. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  24. Planes vs. chairs: Category-guided 3d shape learning without any 3d cues. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I, pages 727–744. Springer, 2022.
  25. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 867–876, 2022.
  26. Codenerf: Disentangled neural radiance fields for object categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12949–12958, 2021.
  27. Nikolay Jetchev. Clipmatrix: Text-controlled creation of 3d textured meshes. arXiv preprint arXiv:2109.12922, 2021.
  28. Geonerf: Generalizing nerf with geometry priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18365–18375, 2022.
  29. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  30. Learning category-specific mesh reconstruction from image collectionsgirdhar2016learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 371–386, 2018.
  31. Text to mesh without 3d supervision using limit subdivision. arXiv preprint arXiv:2203.13333, 2022.
  32. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  33. Viewformer: Nerf-free neural rendering from few images using transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV, pages 198–216. Springer, 2022.
  34. Understanding pure clip guidance for voxel grid nerf models. arXiv preprint arXiv:2209.15172, 2022.
  35. Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
  36. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023.
  37. Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7824–7833, 2022.
  38. Iss: Image as stetting stone for text-guided 3d shape generation. arXiv preprint arXiv:2209.04145, 2022.
  39. Iss++: Image as stepping stone for text-guided 3d shape generation. arXiv preprint arXiv:2303.15181, 2023.
  40. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, pages 210–227. Springer, 2022.
  41. Marching cubes: A high resolution 3d surface construction algorithm. ACM siggraph computer graphics, 21(4):163–169, 1987.
  42. Self-supervised 3d shape and viewpoint estimation from single images for robotics. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6083–6089. IEEE, 2019.
  43. Realfusion: 360 {{\{{\\\backslash\deg}}\}} reconstruction of any object from a single image. arXiv preprint arXiv:2302.10663, 2023.
  44. p⁢c⁢2𝑝𝑐2pc2italic_p italic_c 2: Projection—conditioned point cloud diffusion for single-image 3d reconstruction. arXiv preprint arXiv:2302.10668, 2023.
  45. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019.
  46. Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv preprint arXiv:2211.07600, 2022.
  47. Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13492–13502, 2022.
  48. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  49. Autosdf: Shape priors for 3d completion, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 306–315, 2022.
  50. Autorf: Learning 3d object radiance fields from single view observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3971–3980, 2022.
  51. Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learning, pages 7220–7229. PMLR, 2020.
  52. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  53. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019.
  54. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019.
  55. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  56. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  57. Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508, 2023.
  58. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  59. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  60. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In International Conference on Computer Vision, 2021.
  61. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021.
  62. Volrecon: Volume rendering of signed ray distance functions for generalizable multi-view reconstruction. arXiv preprint arXiv:2212.08067, 2022.
  63. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
  64. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  65. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  66. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2304–2314, 2019.
  67. Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18603–18613, 2022.
  68. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
  69. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021.
  70. Grf: Learning a general radiance field for 3d representation and rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15182–15192, 2021.
  71. Is attention all that nerf needs? In The Eleventh International Conference on Learning Representations, 2022.
  72. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. arXiv preprint arXiv:2212.00774, 2022.
  73. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European conference on computer vision (ECCV), pages 52–67, 2018.
  74. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
  75. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021.
  76. Taps3d: Text-guided 3d textured shape generation from pseudo supervision, 2023.
  77. Pixel2mesh++: Multi-view 3d mesh generation via deformation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1042–1051, 2019.
  78. Multiview compressive coding for 3d reconstruction. arXiv preprint arXiv:2301.08247, 2023.
  79. Marrnet: 3d shape reconstruction via 2.5 d sketches. Advances in neural information processing systems, 30, 2017.
  80. Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2690–2698, 2019.
  81. Pix2vox++: Multi-scale context-aware 3d object reconstruction from single and multiple images. International Journal of Computer Vision, 128(12):2919–2935, 2020.
  82. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360 {{\{{\\\backslash\deg}}\}} views. arXiv preprint arXiv:2211.16431, 2022.
  83. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. arXiv preprint arXiv:2212.14704, 2022.
  84. Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. Advances in neural information processing systems, 32, 2019.
  85. Legoformer: Transformers for block-by-block multi-view 3d reconstruction. arXiv preprint arXiv:2106.12102, 2021.
  86. Robotic grasping through combined image-based grasp proposal and 3d reconstruction. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6350–6356. IEEE, 2021.
  87. Contranerf: Generalizable neural radiance fields for synthetic-to-real novel view synthesis via contrastive learning. arXiv preprint arXiv:2303.11052, 2023.
  88. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 206–215, 2018.
  89. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
  90. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
  91. Adding conditional control to text-to-image diffusion models, 2023.
  92. Nerfusion: Fusing radiance fields for large-scale scene reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5449–5458, 2022.
  93. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023.
  94. Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3955–3963, 2018.
  95. 3d menagerie: Modeling the 3d shape and pose of animals. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6365–6373, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Minghua Liu (22 papers)
  2. Chao Xu (283 papers)
  3. Haian Jin (9 papers)
  4. Linghao Chen (14 papers)
  5. Mukund Varma T (10 papers)
  6. Zexiang Xu (56 papers)
  7. Hao Su (218 papers)
Citations (342)

Summary

Single Image 3D Reconstruction with Zero123 and SDF-Based Techniques

This paper presents a method, termed One-2-3-45, focused on the task of reconstructing a 3D textured mesh from a single image in a span of 45 seconds without requiring per-shape optimization. It builds on the integration of a 2D diffusion model, Zero123, with a custom 3D reconstruction framework. This task is nontrivial due to the inherent challenge of predicting the 3D structure for portions of the scene not visible in the single input image. By aiming for more consistent geometric results across different views, the authors contribute to the broader field of computer vision, 3D modeling, and immersive technologies such as AR/VR.

Methodology

The approach involves three core components: multi-view synthesis, camera pose estimation, and 3D reconstruction. The initial step harnesses Zero123, a fine-tuned version of Stable Diffusion, to generate multi-view images from a single input image based on relative camera transformations. This step exploits the extensive prior knowledge encoded in the 2D diffusion models, particularly effective due to their training on large-scale internet datasets.

A key innovation lies in the 3D reconstruction component, which avoids traditional optimization methods, known for their computational intensiveness and time-consuming nature. Instead, the authors leverage Sparse Neural Representation (SparseNeuS), a framework extending the capabilities of neural scene representations like Multi-View Stereo (MVS). By utilizing an SDF-based rendering technique, the paper reports significant improvements in mesh quality and geometric consistency.

To address the limitations of existing methods in geometry fidelity and runtime, the authors design a reconstruction network trained to manage imperfect multi-view predictions. The proposed system replaces the costly iterative optimization with a direct, feed-forward inference, allowing for rapid 3D mesh reconstruction that correlates more faithfully to the input image.

The paper also addresses the often-ignored aspect of camera pose estimation for the input image necessary for satisfactory 3D lifting. By estimating elevation from synthesized nearby views, the proposed approach leverages an effective heuristic for approximating the appropriate camera settings in the canonical spherical coordinate frame employed by Zero123.

Results

The experimental evaluation includes both qualitative and quantitative assessments, demonstrating notable improvements over competing methods like Point-E, Shap-E, and others. In terms of F-Score, representing geometric accuracy, One-2-3-45 shows superior performance across various datasets, such as Objaverse and GoogleScannedObjects, underscoring its capability to generalize across diverse classes of objects. Moreover, the results adhere closely to the input image, a reflection of the strength in leveraging data-driven visual priors from the diffusion models.

Implications and Future Work

The proposed technique provides a more efficient pipeline for 3D reconstruction, suggesting potential applications in real-time 3D content creation and robotics where rapid environmental mapping is essential. Practically, the method can serve as a foundational tool for applications requiring real-time scene understanding and manipulation, such as in autonomous driving or augmented reality systems.

On a theoretical level, the marriage between 2D diffusion models and 3D reconstruction offers insight into how learned priors from large-scale 2D datasets can be effectively transferred to 3D tasks. This work could catalyze further research into the development of hybrid models that incorporate multi-modal data to bolster both the accuracy and applicability of computer-generated models in complex real-world scenarios.

Further enhancements could focus on tackling the few remaining challenges, such as consistent texture application and handling occlusions in more complex scenes. Additionally, exploring automated methods for scene understanding could expand this framework's utility in dynamic and non-static environments.

Overall, the blend of innovative model design and efficient processing presented in One-2-3-45 marks a significant step forward in addressing the 3D reconstruction challenges from a single image input.

Youtube Logo Streamline Icon: https://streamlinehq.com