Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery (2404.01424v1)

Published 1 Apr 2024 in cs.CV

Abstract: The recovery of occluded human meshes presents challenges for current methods due to the difficulty in extracting effective image features under severe occlusion. In this paper, we introduce DPMesh, an innovative framework for occluded human mesh recovery that capitalizes on the profound diffusion prior about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. Unlike previous methods reliant on conventional backbones for vanilla feature extraction, DPMesh seamlessly integrates the pre-trained denoising U-Net with potent knowledge as its image backbone and performs a single-step inference to provide occlusion-aware information. To enhance the perception capability for occluded poses, DPMesh incorporates well-designed guidance via condition injection, which produces effective controls from 2D observations for the denoising U-Net. Furthermore, we explore a dedicated noisy key-point reasoning approach to mitigate disturbances arising from occlusion and crowded scenarios. This strategy fully unleashes the perceptual capability of the diffusion prior, thereby enhancing accuracy. Extensive experiments affirm the efficacy of our framework, as we outperform state-of-the-art methods on both occlusion-specific and standard datasets. The persuasive results underscore its ability to achieve precise and robust 3D human mesh recovery, particularly in challenging scenarios involving occlusion and crowded scenes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Scape: shape completion and animation of people. In ACM SIGGRAPH, pages 408–416. 2005.
  2. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, pages 7291–7299, 2017.
  3. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607. PMLR, 2020.
  4. Generative approach for probabilistic human mesh recovery using diffusion models. In ICCVW, pages 4183–4188, 2023.
  5. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In ECCV, pages 769–787. Springer, 2020.
  6. Learning to estimate robust 3d human mesh from in-the-wild crowded scenes. In CVPR, pages 1475–1484, 2022a.
  7. Learning to estimate robust 3d human mesh from in-the-wild crowded scenes. In CVPR, pages 1475–1484, 2022b.
  8. Monocular expressive body regression through body-driven attention. In ECCV, pages 20–40. Springer, 2020.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  10. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. TPAMI, 2022.
  11. Diffpose: Spatiotemporal diffusion model for video-based human pose estimation. In ICCV, pages 14861–14872, 2023.
  12. Remips: Physically consistent 3d reconstruction of multiple interacting people under weak supervision. NeurIPS, 34:19385–19397, 2021.
  13. Distribution-aligned diffusion for human mesh recovery. In ICCV, pages 9221–9232, 2023.
  14. Human pose as compositional tokens. In CVPR, 2023.
  15. Holopose: Holistic 3d human reconstruction in-the-wild. In CVPR, pages 10884–10894, 2019.
  16. Densepose: Dense human pose estimation in the wild. In CVPR, pages 7297–7306, 2018.
  17. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  18. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
  19. LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
  20. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI, 36(7):1325–1339, 2013.
  21. Coherent reconstruction of multiple humans from a single image. In CVPR, pages 5579–5588, 2020.
  22. Panoptic studio: A massively multiview system for social motion capture. In ICCV, 2015.
  23. End-to-end recovery of human shape and pose. In CVPR, pages 7122–7131, 2018.
  24. Occluded human mesh recovery. In CVPR, pages 1715–1725, 2022.
  25. Vibe: Video inference for human body pose and shape estimation. In CVPR, pages 5253–5263, 2020.
  26. Pare: Part attention regressor for 3d human body estimation. In ICCV, pages 11127–11137, 2021.
  27. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In CVPR, pages 2252–2261, 2019a.
  28. Convolutional mesh regression for single-image human shape reconstruction. In CVPR, pages 4501–4510, 2019b.
  29. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR, pages 10863–10872, 2019.
  30. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, pages 3383–3393, 2021.
  31. Niki: Neural inverse kinematics with invertible neural networks for 3d human pose and shape estimation. In CVPR, pages 12933–12942, 2023a.
  32. Jotr: 3d joint contrastive learning with transformers for occluded human mesh recovery. In ICCV, pages 9110–9121, 2023b.
  33. Cliff: Carrying location information in full frames into human pose and shape estimation. In ECCV, pages 590–606. Springer, 2022.
  34. End-to-end human pose and mesh reconstruction with transformers. In CVPR, pages 1954–1963, 2021.
  35. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  36. Swin transformer v2: Scaling up capacity and resolution. In CVPR, pages 12009–12019, 2022.
  37. Smpl: A skinned multi-person linear model. ACM ToG, 34(6), 2015.
  38. 3d human mesh estimation from virtual markers. In CVPR, pages 534–543, 2023.
  39. Amass: Archive of motion capture as surface shapes. In ICCV, 2019.
  40. Single-shot multi-person 3d pose estimation from monocular rgb. In 3DV, pages 120–130. IEEE, 2018.
  41. Posefix: Model-agnostic general human pose refinement network. In CVPR, pages 7773–7781, 2019.
  42. Neuralannot: Neural annotator for 3d human mesh training sets. In CVPR, pages 2299–2307, 2022.
  43. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 32, 2019.
  44. Learning to estimate 3d human pose and shape from a single color image. In CVPR, pages 459–468, 2018.
  45. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
  46. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  47. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  48. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  49. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
  50. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
  51. Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
  52. Monocular, one-stage, regression of multiple 3d people. In ICCV, pages 11179–11188, 2021a.
  53. Monocular, one-stage, regression of multiple 3d people. In ICCV, pages 11179–11188, 2021b.
  54. Putting people in their place: Monocular regression of 3d people in depth. In CVPR, pages 13243–13252, 2022.
  55. Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, 2018.
  56. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. NeurIPS, 36, 2024.
  57. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In CVPR, pages 2148–2157, 2018a.
  58. Deep network for the integrated 3d sensing of multiple people in natural images. NeurIPS, 31, 2018b.
  59. Weakly supervised 3d human pose and shape reconstruction with normalizing flows. In ECCV, pages 465–481. Springer, 2020.
  60. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In ICCV, pages 11446–11456, 2021.
  61. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023.
  62. Pose2seg: Detection free human instance segmentation. In CVPR, pages 889–898, 2019.
  63. Object-occluded human shape and pose estimation from a single color image. In CVPR, 2020.
  64. Unleashing text-to-image diffusion models for visual perception. ICCV, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.