HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud (2404.03159v1)
Abstract: Extracting keypoint locations from input hand frames, known as 3D hand pose estimation, is a critical task in various human-computer interaction applications. Essentially, the 3D hand pose estimation can be regarded as a 3D point subset generative problem conditioned on input frames. Thanks to the recent significant progress on diffusion-based generative models, hand pose estimation can also benefit from the diffusion model to estimate keypoint locations with high quality. However, directly deploying the existing diffusion models to solve hand pose estimation is non-trivial, since they cannot achieve the complex permutation mapping and precise localization. Based on this motivation, this paper proposes HandDiff, a diffusion-based hand pose estimation model that iteratively denoises accurate hand pose conditioned on hand-shaped image-point clouds. In order to recover keypoint permutation and accurate location, we further introduce joint-wise condition and local detail condition. Experimental results demonstrate that the proposed HandDiff significantly outperforms the existing approaches on four challenging hand pose benchmark datasets. Codes and pre-trained models are publicly available at https://github.com/cwc1260/HandDiff.
- Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126, 2021.
- Denoising pretraining for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4175–4186, 2022.
- Dexycb: A benchmark for capturing hand grasping of objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9044–9053, 2021.
- Diffusiondet: Diffusion model for object detection. arXiv preprint arXiv:2211.09788, 2022.
- Shpr-net: Deep semantic hand pose regression from point clouds. IEEE Access, 6:43425–43439, 2018.
- Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing, 395:138–149, 2020.
- Efficient virtual view selection for 3d hand pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 419–426, 2022.
- Point auto-encoder and its application to 2d-3d transformation. In International Symposium on Visual Computing, pages 66–78. Springer, 2019.
- Handfoldingnet: A 3d hand pose estimation network using multiscale-feature guided folding of a 2d hand skeleton. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11260–11269, 2021.
- Diffupose: Monocular 3d human pose estimation via denoising diffusion probabilistic model. arXiv preprint arXiv:2212.02796, 2022.
- Crossinfonet: Multi-task information sharing based hand pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9896–9905, 2019.
- Jgr-p2o: Joint graph reasoning based pixel-to-offset prediction network for 3d hand pose estimation from a single depth image. In European Conference on Computer Vision, pages 120–137. Springer, 2020.
- Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3593–3601, 2016.
- 3d convolutional neural networks for efficient and robust hand pose estimation from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1991–2000, 2017.
- Hand pointnet: 3d hand pose estimation using point sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8417–8426, 2018a.
- Point-to-point regression pointnet for 3d hand pose estimation. In Proceedings of the European conference on computer vision (ECCV), pages 475–491, 2018b.
- Diffpose: Toward more reliable 3d pose estimation. arXiv preprint arXiv:2211.16940, 2022.
- Region ensemble network: Improving convolutional network for hand pose estimation. In 2017 IEEE International Conference on Image Processing (ICIP), pages 4512–4516. IEEE, 2017.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23(47):1–33, 2022.
- Diffpose: Multi-hypothesis human pose estimation using diffusion models. arXiv preprint arXiv:2211.16487, 2022.
- Score-based generative modeling of graphs via the system of stochastic differential equations. In International Conference on Machine Learning, pages 10362–10383. PMLR, 2022.
- Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11927–11936, 2019.
- End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1954–1963, 2021.
- Flownet3d: Learning scene flow in 3d point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 529–537, 2019.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
- A conditional point diffusion-refinement paradigm for 3d point cloud completion. arXiv preprint arXiv:2112.03530, 2021.
- V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In Proceedings of the IEEE conference on computer vision and pattern Recognition, pages 5079–5088, 2018.
- Permutation invariant graph generation via score-based generative modeling. In International Conference on Artificial Intelligence and Statistics, pages 4474–4484. PMLR, 2020.
- Deepprior++: Improving fast and accurate 3d hand pose estimation. In Proceedings of the IEEE international conference on computer vision Workshops, pages 585–594, 2017.
- Handoccnet: Occlusion-robust 3d hand mesh estimation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1496–1505, 2022.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017a.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017b.
- Srn: Stacked regression network for real-time 3d hand pose estimation. In BMVC, page 112, 2019.
- Pose-guided hierarchical graph reasoning for 3-d hand pose estimation from a single depth image. IEEE Transactions on Cybernetics, 2021a.
- Spatial-aware stacked regression network for real-time 3d hand pose estimation. Neurocomputing, 437:42–57, 2021b.
- Two heads are better than one: image-point cloud network for depth-based 3d hand pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2163–2171, 2023.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022a.
- Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022b.
- Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. arXiv preprint arXiv:2303.11579, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Weakly supervised 3d hand pose estimation via biomechanical constraints. In European conference on computer vision, pages 211–228. Springer, 2020.
- Cascaded hand pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 824–832, 2015.
- Latent regression forest: Structured estimation of 3d articulated hand posture. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3786–3793, 2014.
- Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics (ToG), 33(5):1–10, 2014.
- Collaborative learning for hand and object reconstruction with attention-guided graph convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1664–1674, 2022.
- Lion: Latent point diffusion models for 3d shape generation. Advances in Neural Information Processing Systems, 35:10021–10039, 2022.
- Digress: Discrete denoising diffusion for graph generation. arXiv preprint arXiv:2209.14734, 2022.
- Dense 3d regression for hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5147–5156, 2018.
- A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 793–802, 2019.
- 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826–5835, 2021.