SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion (2311.15855v2)
Abstract: A long-standing goal of 3D human reconstruction is to create lifelike and fully detailed 3D humans from single-view images. The main challenge lies in inferring unknown body shapes, appearances, and clothing details in areas not visible in the images. To address this, we propose SiTH, a novel pipeline that uniquely integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow. At the core of our method lies the decomposition of the challenging single-view reconstruction problem into generative hallucination and reconstruction subproblems. For the former, we employ a powerful generative diffusion model to hallucinate unseen back-view appearance based on the input images. For the latter, we leverage skinned body meshes as guidance to recover full-body texture meshes from the input and back-view images. SiTH requires as few as 500 3D human scans for training while maintaining its generality and robustness to diverse images. Extensive evaluations on two 3D human benchmarks, including our newly created one, highlighted our method's superior accuracy and perceptual quality in 3D textured human reconstruction. Our code and evaluation benchmark are available at https://ait.ethz.ch/sith
- Renderpeople, https://renderpeople.com/.
- Stable diffusion image variations. huggingface.co/lambdalabs/stable-diffusion-image-conditioned.
- Single-image 3d human digitization with shape-guided diffusion. In SIGGRAPH Asia, 2023.
- Photorealistic monocular 3d reconstruction of humans wearing clothing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Method for registration of 3-d shapes. In Sensor fusion IV: control paradigms and data structures, pages 586–606. Spie, 1992.
- Demystifying MMD gans. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision (ECCV). Springer International Publishing, 2016.
- Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020.
- Smpler-x: Scaling up expressive human pose and shape estimation. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
- Structured 3d features for reconstructing relightable and animatable avatars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Pina: Learning a personalized implicit neural avatar from a single rgb-d video sequence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Fof: Learning fourier occupancy field for monocular real-time human reconstruction. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- High-fidelity 3d human digitization from single 2k resolution images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022.
- Arch++: Animation-ready clothed human reconstruction revisited. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
- Learning locally editable virtual humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- TeCH: Text-guided Reconstruction of Lifelike Clothed Humans. In International Conference on 3D Vision (3DV), 2024.
- Arch: Animatable reconstruction of clothed humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Selfrecon: Self reconstruction your digital avatar from monocular video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022a.
- Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. arXiv preprint arXiv:2303.17606, 2023a.
- Instantavatar: Learning avatars from monocular video in 60 seconds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- Neuman: Neural human radiance field from a single video. In Proceedings of the European Conference on Computer Vision (ECCV), 2022b.
- Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), pages 694–711. Springer, 2016.
- Dreampose: Fashion image-to-video synthesis via stable diffusion. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
- aitviewer, 2022.
- Screened poisson surface reconstruction. ACM Transactions on Graphics (TOG), 32(3):1–13, 2013.
- Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, 2006.
- Chupa: Carving 3d clothed humans from skinned shape priors using 2d diffusion probabilistic models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
- Segment anything. arXiv:2304.02643, 2023.
- Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning (ICML), 2022.
- High-Fidelity Clothed Avatar Reconstruction from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
- SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015.
- Marching cubes: A high resolution 3d surface construction algorithm, 1998.
- Mediapipe: A framework for perceiving and processing reality. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2019.
- Learning to Dress 3D People in Generative Clothing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (TOG), 41(4):102:1–102:15, 2022.
- Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2016.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Dreamfusion: Text-to-3d using 2d diffusion. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems (NeurIPS), 35, 2022.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019.
- Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- X-avatar: Expressive human avatars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Dinar: Diffusion inpainting of neural textures for one-shot human avatars. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
- What do single-view 3d reconstruction networks learn? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Advances in neural rendering. In Annual Conference of the European Association for Computer Graphics (EUROGRAPHICS), pages 703–735. Wiley Online Library, 2022.
- Recovering 3d human mesh from monocular images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.
- HumanNeRF: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 16210–16220, 2022.
- Neural fields in visual computing and beyond. In Annual Conference of the European Association for Computer Graphics (EUROGRAPHICS), pages 641–676. Wiley Online Library, 2022.
- ICON: Implicit Clothed humans Obtained from Normals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- ECON: Explicit Clothed humans Optimized via Normal integration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Avatarverse: High-quality stable 3d avatar creation from text and pose. arXiv preprint arXiv:2308.03610, 2023a.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023b.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.