One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization (2306.16928v1)
Abstract: Single image 3D reconstruction is an important but challenging task that requires extensive knowledge of our natural world. Many existing methods solve this problem by optimizing a neural radiance field under the guidance of 2D diffusion models but suffer from lengthy optimization time, 3D inconsistency results, and poor geometry. In this work, we propose a novel method that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. Given a single image, we first use a view-conditioned 2D diffusion model, Zero123, to generate multi-view images for the input view, and then aim to lift them up to 3D space. Since traditional reconstruction methods struggle with inconsistent multi-view predictions, we build our 3D reconstruction module upon an SDF-based generalizable neural surface reconstruction method and propose several critical training strategies to enable the reconstruction of 360-degree meshes. Without costly optimizations, our method reconstructs 3D shapes in significantly less time than existing methods. Moreover, our method favors better geometry, generates more 3D consistent results, and adheres more closely to the input image. We evaluate our approach on both synthetic data and in-the-wild images and demonstrate its superiority in terms of both mesh quality and runtime. In addition, our approach can seamlessly support the text-to-3D task by integrating with off-the-shelf text-to-image diffusion models.
- Learning representations and generative models for 3d point clouds. In International conference on machine learning, pages 40β49. PMLR, 2018.
- Clipface: Text-guided editing of textured 3d morphable models. arXiv preprint arXiv:2212.01406, 2022.
- Text and image guided 3d avatar generation and manipulation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4421β4431, 2023.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Tensorf: Tensorial radiance fields. In Computer VisionβECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23β27, 2022, Proceedings, Part XXXII, pages 333β350. Springer, 2022.
- Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124β14133, 2021.
- Text2tex: Text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396, 2023.
- Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. arXiv preprint arXiv:2212.04493, 2022.
- Automatic class-specific 3d reconstruction from a single image. CSAIL, pages 1β9, 2009.
- 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer VisionβECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 628β644. Springer, 2016.
- Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022.
- Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. arXiv preprint arXiv:2212.03267, 2022.
- Blenderproc2: A procedural pipeline for photorealistic rendering. Journal of Open Source Software, 8(82):4901, 2023.
- Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553β2560. IEEE, 2022.
- A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605β613, 2017.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841β31854, 2022.
- Learning a predictable and generative vector representation for objects. In Computer VisionβECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 484β499. Springer, 2016.
- A papier-mΓ’chΓ© approach to learning 3d surface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 216β224, 2018.
- 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
- Unsupervised learning of 3d object categories from videos in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4700β4709, 2021.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Planes vs. chairs: Category-guided 3d shape learning without any 3d cues. In Computer VisionβECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23β27, 2022, Proceedings, Part I, pages 727β744. Springer, 2022.
- Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 867β876, 2022.
- Codenerf: Disentangled neural radiance fields for object categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12949β12958, 2021.
- Nikolay Jetchev. Clipmatrix: Text-controlled creation of 3d textured meshes. arXiv preprint arXiv:2109.12922, 2021.
- Geonerf: Generalizing nerf with geometry priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18365β18375, 2022.
- Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
- Learning category-specific mesh reconstruction from image collectionsgirdhar2016learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 371β386, 2018.
- Text to mesh without 3d supervision using limit subdivision. arXiv preprint arXiv:2203.13333, 2022.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Viewformer: Nerf-free neural rendering from few images using transformers. In Computer VisionβECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23β27, 2022, Proceedings, Part XV, pages 198β216. Springer, 2022.
- Understanding pure clip guidance for voxel grid nerf models. arXiv preprint arXiv:2209.15172, 2022.
- Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
- Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023.
- Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7824β7833, 2022.
- Iss: Image as stetting stone for text-guided 3d shape generation. arXiv preprint arXiv:2209.04145, 2022.
- Iss++: Image as stepping stone for text-guided 3d shape generation. arXiv preprint arXiv:2303.15181, 2023.
- Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In Computer VisionβECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23β27, 2022, Proceedings, Part XXXII, pages 210β227. Springer, 2022.
- Marching cubes: A high resolution 3d surface construction algorithm. ACM siggraph computer graphics, 21(4):163β169, 1987.
- Self-supervised 3d shape and viewpoint estimation from single images for robotics. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6083β6089. IEEE, 2019.
- Realfusion: 360 {{\{{\\\backslash\deg}}\}} reconstruction of any object from a single image. arXiv preprint arXiv:2302.10663, 2023.
- pβ’cβ’2ππ2pc2italic_p italic_c 2: Projectionβconditioned point cloud diffusion for single-image 3d reconstruction. arXiv preprint arXiv:2302.10668, 2023.
- Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460β4470, 2019.
- Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv preprint arXiv:2211.07600, 2022.
- Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13492β13502, 2022.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99β106, 2021.
- Autosdf: Shape priors for 3d completion, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 306β315, 2022.
- Autorf: Learning 3d object radiance fields from single view observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3971β3980, 2022.
- Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learning, pages 7220β7229. PMLR, 2020.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
- Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165β174, 2019.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975β10985, 2019.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748β8763. PMLR, 2021.
- Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508, 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821β8831. PMLR, 2021.
- Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In International Conference on Computer Vision, 2021.
- Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10901β10911, 2021.
- Volrecon: Volume rendering of signed ray distance functions for generalizable multi-view reconstruction. arXiv preprint arXiv:2212.08067, 2022.
- Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684β10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2304β2314, 2019.
- Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18603β18613, 2022.
- Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
- Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922β8931, 2021.
- Grf: Learning a general radiance field for 3d representation and rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15182β15192, 2021.
- Is attention all that nerf needs? In The Eleventh International Conference on Learning Representations, 2022.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. arXiv preprint arXiv:2212.00774, 2022.
- Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European conference on computer vision (ECCV), pages 52β67, 2018.
- Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
- Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690β4699, 2021.
- Taps3d: Text-guided 3d textured shape generation from pseudo supervision, 2023.
- Pixel2mesh++: Multi-view 3d mesh generation via deformation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1042β1051, 2019.
- Multiview compressive coding for 3d reconstruction. arXiv preprint arXiv:2301.08247, 2023.
- Marrnet: 3d shape reconstruction via 2.5 d sketches. Advances in neural information processing systems, 30, 2017.
- Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2690β2698, 2019.
- Pix2vox++: Multi-scale context-aware 3d object reconstruction from single and multiple images. International Journal of Computer Vision, 128(12):2919β2935, 2020.
- Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360 {{\{{\\\backslash\deg}}\}} views. arXiv preprint arXiv:2211.16431, 2022.
- Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. arXiv preprint arXiv:2212.14704, 2022.
- Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. Advances in neural information processing systems, 32, 2019.
- Legoformer: Transformers for block-by-block multi-view 3d reconstruction. arXiv preprint arXiv:2106.12102, 2021.
- Robotic grasping through combined image-based grasp proposal and 3d reconstruction. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6350β6356. IEEE, 2021.
- Contranerf: Generalizable neural radiance fields for synthetic-to-real novel view synthesis via contrastive learning. arXiv preprint arXiv:2303.11052, 2023.
- Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 206β215, 2018.
- pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578β4587, 2021.
- Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
- Adding conditional control to text-to-image diffusion models, 2023.
- Nerfusion: Fusing radiance fields for large-scale scene reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5449β5458, 2022.
- Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023.
- Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3955β3963, 2018.
- 3d menagerie: Modeling the 3d shape and pose of animals. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6365β6373, 2017.
- Minghua Liu (22 papers)
- Chao Xu (283 papers)
- Haian Jin (9 papers)
- Linghao Chen (14 papers)
- Mukund Varma T (10 papers)
- Zexiang Xu (56 papers)
- Hao Su (218 papers)