Learning Structure-Guided Diffusion Model for 2D Human Pose Estimation (2306.17074v1)
Abstract: One of the mainstream schemes for 2D human pose estimation (HPE) is learning keypoints heatmaps by a neural network. Existing methods typically improve the quality of heatmaps by customized architectures, such as high-resolution representation and vision Transformers. In this paper, we propose \textbf{DiffusionPose}, a new scheme that formulates 2D HPE as a keypoints heatmaps generation problem from noised heatmaps. During training, the keypoints are diffused to random distribution by adding noises and the diffusion model learns to recover ground-truth heatmaps from noised heatmaps with respect to conditions constructed by image feature. During inference, the diffusion model generates heatmaps from initialized heatmaps in a progressive denoising way. Moreover, we further explore improving the performance of DiffusionPose with conditions from human structural information. Extensive experiments show the prowess of our DiffusionPose, with improvements of 1.6, 1.2, and 1.2 mAP on widely-used COCO, CrowdPose, and AI Challenge datasets, respectively.
- Label-efficient semantic segmentation with diffusion models. ICLR, 2022.
- Denoising pretraining for semantic segmentation. In CVPR, pages 4175–4186, 2022.
- Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
- Diffusiondet: Diffusion model for object detection. arXiv preprint arXiv:2211.09788, 2022.
- A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366, 2022.
- Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
- Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR, pages 5386–5395, 2020.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
- Davit: Dual attention vision transformers. In ECCV, pages 74–92. Springer, 2022.
- I22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTR-Net: Intra-and inter-human relation network for multi-person pose estimation. IJCAI, 2022.
- Learning to refine human pose estimation. In CVPRW, pages 205–214, 2018.
- Bottom-up human pose estimation via disentangled keypoint regression. In CVPR, pages 14676–14686, 2021.
- Diffpose: Toward more reliable 3d pose estimation. arXiv preprint arXiv:2211.16940, 2022.
- Diffusioninst: Diffusion model for instance segmentation. arXiv preprint arXiv:2212.02773, 2022.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020.
- Video diffusion models. NeurIPS, 2022.
- Diffpose: Multi-hypothesis human pose estimation using diffusion models. arXiv preprint arXiv:2211.16487, 2022.
- The devil is in the details: Delving into unbiased data processing for human pose estimation. In CVPR, pages 5700–5709, 2020.
- Pifpaf: Composite fields for human pose estimation. In CVPR, pages 11977–11986, 2019.
- Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022.
- Human pose regression with residual log-likelihood estimation. In ICCV, pages 11025–11034, 2021.
- Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR, pages 10863–10872, 2019.
- Tokenpose: Learning keypoint tokens for human pose estimation. In ICCV, pages 11313–11322, 2021.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Disentangling and unifying graph convolutions for skeleton-based action recognition. In CVPR, pages 143–152, 2020.
- Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv preprint arXiv:2211.07600, 2022.
- Posefix: Model-agnostic general human pose refinement network. In CVPR, pages 7773–7781, 2019.
- Stacked hourglass networks for human pose estimation. In ECCV, pages 483–499. Springer, 2016.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pages 16784–16804. PMLR, 2022.
- Peeking into occluded joints: A novel framework for crowd pose estimation. In ECCV, pages 488–504. Springer, 2020.
- Learning recurrent structure-guided attention network for multi-person pose estimation. In ICME, pages 418–423. IEEE, 2019.
- Dgcn: Dynamic graph convolutional network for efficient multi-person pose estimation. In AAAI, volume 34, pages 11924–11931, 2020.
- Learning spatiotemporal frequency-transformer for compressed video super-resolution. In ECCV, pages 257–273. Springer, 2022.
- Psvt: End-to-end multi-person 3d pose and shape estimation with progressive video transformers. In CVPR, pages 21254–21263, 2023.
- Image super-resolution via iterative refinement. TPAMI, 2022.
- End-to-end multi-person pose estimation with transformers. In CVPR, pages 11069–11078, 2022.
- Denoising diffusion implicit models. In ICLR, 2021.
- Score-based generative modeling through stochastic differential equations. ICLR, 2021.
- Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
- Compositional human pose regression. In ICCV, pages 2602–2611, 2017.
- Contextual instance decoupling for robust multi-person pose estimation. In CVPR, pages 11060–11068, 2022.
- Graph-pcnn: Two stage human pose estimation with graph pose refinement. In ECCV, pages 492–508. Springer, 2020.
- Multi-tailed vision transformer for efficient inference. arXiv preprint arXiv:2203.01587, 2022.
- Learning to schedule in diffusion probablisitic models. In KDD, 2023.
- Point-set anchors for object detection, instance segmentation and pose estimation. In ECCV, pages 527–544. Springer, 2020.
- Large-scale datasets for going deeper in image understanding. In ICME, pages 1480–1485. IEEE, 2019.
- Simple baselines for human pose estimation and tracking. In ECCV, pages 466–481, 2018.
- H-nerf: Neural radiance fields for rendering and temporal reconstruction of humans in motion. NeurIPS, 34:14955–14966, 2021.
- Vitpose: Simple vision transformer baselines for human pose estimation. In NeurIPS, 2022.
- Explicit box detection unifies end-to-end multi-person pose estimation. In ICLR, 2023.
- Transpose: Keypoint localization via transformer. In ICCV, pages 11802–11812, 2021.
- Hrformer: High-resolution transformer for dense prediction. In NeurIPS, 2021.
- Distribution-aware coordinate representation for human pose estimation. In CVPR, pages 7093–7102, 2020.
- Deephuman: 3d human reconstruction from a single image. In ICCV, pages 7739–7749, 2019.
- Objects as points. arXiv preprint arXiv:1904.07850, 2019.