RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose (2303.07399v2)
Abstract: Recent studies on 2D pose estimation have achieved excellent performance on public benchmarks, yet its application in the industrial community still suffers from heavy model parameters and high latency. In order to bridge this gap, we empirically explore key factors in pose estimation including paradigm, model architecture, training strategy, and deployment, and present a high-performance real-time multi-person pose estimation framework, RTMPose, based on MMPose. Our RTMPose-m achieves 75.8% AP on COCO with 90+ FPS on an Intel i7-11700 CPU and 430+ FPS on an NVIDIA GTX 1660 Ti GPU, and RTMPose-l achieves 67.0% AP on COCO-WholeBody with 130+ FPS. To further evaluate RTMPose's capability in critical real-time applications, we also report the performance after deploying on the mobile device. Our RTMPose-s achieves 72.2% AP on COCO with 70+ FPS on a Snapdragon 865 chip, outperforming existing open-source libraries. Code and models are released at https://github.com/open-mmlab/mmpose/tree/1.x/projects/rtmpose.
- 2d human pose estimation: New benchmark and state of the art analysis. Computer Vision and Pattern Recognition, 2014.
- PaddlePaddle Authors. Paddledetection, object detection and instance segmentation toolkit based on paddlepaddle. https://github.com/PaddlePaddle/PaddleDetection.
- Blazepose: On-device real-time body pose tracking, 2020.
- Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
- Learning delicate local representations for multi-person pose estimation. In ECCV, pages 455–472. Springer, 2020.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- 1€ filter: a simple speed-based low-pass filter for noisy input in interactive systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2527–2530, 2012.
- Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5386–5395, 2020.
- MMPose Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose, 2020.
- MMDeploy Contributors. Openmmlab’s model deployment toolbox. https://github.com/open-mmlab/mmdeploy, 2021.
- Imagenet: A large-scale hierarchical image database. Computer Vision and Pattern Recognition, 2009.
- Improved regularization of convolutional neural networks with cutout. arXiv: Computer Vision and Pattern Recognition, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Soft labels for ordinal regression. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
- Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14676–14686, 2021.
- Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
- Deep residual learning for image recognition. Cornell University - arXiv, 2015.
- Single-network whole-body pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6982–6991, 2019.
- Transformer quality in linear time. ArXiv, abs/2202.10447, 2022.
- The devil is in the details: Delving into unbiased data processing for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- The devil is in the details: Delving into unbiased data processing for human pose estimation. In CVPR, pages 5700–5709, 2020.
- Aid: Pushing the performance boundary of human pose estimation with information dropping augmentation, 2020.
- Differentiable hierarchical graph grouping for multi-person pose estimation. In European Conference on Computer Vision, pages 718–734. Springer, 2020.
- Whole-body human pose estimation in the wild, 2020.
- Pifpaf: Composite fields for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11977–11986, 2019.
- Human pose regression with residual log-likelihood estimation. In ICCV, 2021.
- Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. arXiv preprint arXiv:1812.00324, 2018.
- Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. Cornell University - arXiv, 2018.
- Pose recognition with cascade transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1944–1953, 2021.
- Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527, 2022.
- Simcc: a simple coordinate classification perspective for human pose estimation, 2021.
- Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11313–11322, 2021.
- Microsoft COCO: Common objects in context. In ECCV, 2014.
- Polarized self-attention: towards high-quality pixel-wise regression. arXiv preprint arXiv:2107.00782, 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
- Rethinking the heatmap regression for bottom-up human pose estimation. In CVPR, pages 13264–13273, 2021.
- Rtmdet: An empirical study of designing real-time object detectors, 2022.
- Poseur: Direct human pose regression with transformers. In European Conference on Computer Vision, pages 72–88. Springer, 2022.
- Associative embedding: End-to-end learning for joint detection and grouping. NIPS, 30, 2017.
- Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, pages 4929–4937, 2016.
- RangiLyu. Nanodet-plus: Super fast and high accuracy lightweight anchor-free object detection model. https://github.com/RangiLyu/nanodet, 2021.
- Yolov3: An incremental improvement. arXiv: Computer Vision and Pattern Recognition, 2018.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Cornell University - arXiv, 2015.
- Mobilenetv2: Inverted residuals and linear bottlenecks. Cornell University - arXiv, 2018.
- Noam Shazeer. Glu variants improve transformer, 2020.
- End-to-end multi-person pose estimation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11069–11078, 2022.
- Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Next-generation pose detection with movenet and tensorflow.js. 2023.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv: Computer Vision and Pattern Recognition, 2021.
- Ai challenger : A large-scale dataset for going deeper in image understanding. arXiv: Computer Vision and Pattern Recognition, 2017.
- Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), 2018.
- Zoomnas: Searching for whole-body human pose estimation in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Vitpose: Simple vision transformer baselines for human pose estimation, 2022.
- Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11802–11812, 2021.
- Lite-hrnet: A lightweight high-resolution network. In CVPR, 2021.
- Ap-10k: A benchmark for animal pose estimation in the wild. Cornell University - arXiv, 2021.
- Distribution-aware coordinate representation for human pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890, 2021.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.