DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans (2404.00485v1)
Abstract: We present DiffHuman, a probabilistic method for photorealistic 3D human reconstruction from a single RGB image. Despite the ill-posed nature of this problem, most methods are deterministic and output a single solution, often resulting in a lack of geometric detail and blurriness in unseen or uncertain regions. In contrast, DiffHuman predicts a probability distribution over 3D reconstructions conditioned on an input 2D image, which allows us to sample multiple detailed 3D avatars that are consistent with the image. DiffHuman is implemented as a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation. During inference, we may sample 3D avatars by iteratively denoising 2D renders of the predicted 3D representation. Furthermore, we introduce a generator neural network that approximates rendering with considerably reduced runtime (55x speed up), resulting in a novel dual-branch diffusion framework. Our experiments show that DiffHuman can produce diverse and detailed reconstructions for the parts of the person that are unseen or uncertain in the input image, while remaining competitive with the state-of-the-art when reconstructing visible surfaces.
- Detailed human avatars from monocular video. In Proceedings of the Interntional Conference on 3D Vision (3DV), 2018a.
- Video based reconstruction of 3D people models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018b.
- Learning to reconstruct people in clothing from a single rgb camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019a.
- Tex2shape: Detailed full human body geometry from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2293–2303, 2019b.
- Photorealistic monocular 3D reconstruction of humans wearing clothing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Scape: shape completion and animation of people. SIGGRAPH, 2005.
- 3d multi-bodies: Fitting sets of plausible 3d human models to ambiguous image data. Advances in Neural Information Processing Systems, 33:20496–20507, 2020.
- Christopher M. Bishop. Mixture Density Networks. Technical report, Aston University, 1994.
- Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
- Structured 3d features for reconstructing relightable and animatable avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Learning dynamic relationships for 3d human motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6519–6527, 2020.
- Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- Diffpose: Toward more reliable 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13041–13051, 2023.
- Implicit geometric regularization for learning shapes. ICML, 2020.
- Holopose: Holistic 3D human reconstruction in-the-wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- John C Hart. Sphere tracing: A geometric method for the antialiased ray tracing of implicit surfaces. The Visual Computer, 12(10):527–545, 1996.
- Geo-pifu: Geometry and pixel aligned implicit functions for single-view human reconstruction. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Arch++: Animation-ready clothed human reconstruction revisited. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Jonathan Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
- Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239, 2020.
- Sherf: Generalizable human nerf from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- One-shot implicit animatable avatars with model-based priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8974–8985, 2023.
- TeCH: Text-guided Reconstruction of Lifelike Clothed Humans. In Proceedings of the International Conference on 3D Vision (3DV), 2024.
- Arch: Animatable reconstruction of clothed humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Humanrf: High-fidelity neural radiance fields for humans in motion. ACM Transactions on Graphics (TOG), 42(4):1–12, 2023.
- Neuman: Neural human radiance field from a single video. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018.
- What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- PARE: Part attention regressor for 3D human body estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2252–2261, 2019.
- Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11605–11614, 2021.
- 360-degree textures of people in clothing from a single image. In Proceedings of the International Conference on 3D Vision (3DV), pages 643–653. IEEE, 2019.
- Generating multiple hypotheses for 3d human pose estimation with mixture density network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9887–9895, 2019.
- Posefusion: Pose-guided selective fusion for single-view human volumetric capture. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14162–14172, 2021.
- Posevocab: Learning joint-structured pose embeddings for human avatar modeling. In ACM SIGGRAPH Conference Proceedings, 2023.
- Smpl: A skinned multi-person linear model. ToG, 2015.
- Marching cubes: A high resolution 3D surface construction algorithm. SIGGRAPH, 1987.
- Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2837–2845, 2021.
- Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
- Occupancy networks: Learning 3D reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4328–4338, 2023.
- Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In Proceedings of the International Conference on 3D Vision (3DV), 2018.
- Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019.
- Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 459–468, 2018.
- RenderPeople Dataset. Renderpeople dataset. https://renderpeople.com/.
- High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inform. Process. Syst., 35:36479–36494, 2022.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Synthetic training for accurate 3d human pose and shape estimation in the wild. In Proceedings of the British Machine Vision Conference (BMVC), 2020.
- Hierarchical kinematic probability distributions for 3D human shape and pose estimation from images in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11219–11229, 2021.
- Humaniflow: Ancestor-conditioned normalising flows on SO (3) manifolds for human pose and shape distribution estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4779–4789, 2023.
- Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- Viewset diffusion: (0-)image-conditioned 3D generative models from 2D data. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Indirect deep structured learning for 3D human shape and pose prediction. In Proceedings of the British Machine Vision Conference (BMVC), 2017.
- Diffusion with forward models: Solving stochastic inverse problems without direct supervision. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Probabilistic monocular 3d human pose estimation with normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11199–11208, 2021.
- HumanNeRF: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16210–16220, 2022.
- Monoclothcap: Towards temporally coherent clothing capture from monocular rgb video. In Proceedings of the International Conference on 3D Vision (3DV), pages 322–332. IEEE, 2020.
- Icon: Implicit clothed humans obtained from normals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- ECON: Explicit Clothed humans Optimized via Normal integration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Ghum & ghuml: Generative 3D human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- H-nerf: Neural radiance fields for rendering and temporal reconstruction of humans in motion. Advances in Neural Information Processing Systems (NeurIPS), 2021.
- D-if: Uncertainty-aware human digitization via implicit distribution field. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9122–9132, 2023.
- Multiview neural surface reconstruction by disentangling geometry and appearance. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Neural descent for visual 3d human pose and shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14484–14493, 2021a.
- Thundr: Transformer-based 3D human reconstruction with markers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
- Lion: Latent point diffusion models for 3d shape generation. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Danet: Decompose-and-aggregate network for 3D human shape and pose estimation. In Proceedings of the 27th ACM International Conference on Multimedia, pages 935–944, 2019.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In IEEE Conf. Comput. Vis. Pattern Recog., 2018.
- Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. PAMI, 2021.
- Detailed human shape estimation from a single image by hierarchical mesh deformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.