Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose (2303.07399v2)

Published 13 Mar 2023 in cs.CV

Abstract: Recent studies on 2D pose estimation have achieved excellent performance on public benchmarks, yet its application in the industrial community still suffers from heavy model parameters and high latency. In order to bridge this gap, we empirically explore key factors in pose estimation including paradigm, model architecture, training strategy, and deployment, and present a high-performance real-time multi-person pose estimation framework, RTMPose, based on MMPose. Our RTMPose-m achieves 75.8% AP on COCO with 90+ FPS on an Intel i7-11700 CPU and 430+ FPS on an NVIDIA GTX 1660 Ti GPU, and RTMPose-l achieves 67.0% AP on COCO-WholeBody with 130+ FPS. To further evaluate RTMPose's capability in critical real-time applications, we also report the performance after deploying on the mobile device. Our RTMPose-s achieves 72.2% AP on COCO with 70+ FPS on a Snapdragon 865 chip, outperforming existing open-source libraries. Code and models are released at https://github.com/open-mmlab/mmpose/tree/1.x/projects/rtmpose.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. 2d human pose estimation: New benchmark and state of the art analysis. Computer Vision and Pattern Recognition, 2014.
  2. PaddlePaddle Authors. Paddledetection, object detection and instance segmentation toolkit based on paddlepaddle. https://github.com/PaddlePaddle/PaddleDetection.
  3. Blazepose: On-device real-time body pose tracking, 2020.
  4. Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
  5. Learning delicate local representations for multi-person pose estimation. In ECCV, pages 455–472. Springer, 2020.
  6. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  7. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
  8. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  9. 1€ filter: a simple speed-based low-pass filter for noisy input in interactive systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2527–2530, 2012.
  10. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5386–5395, 2020.
  11. MMPose Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose, 2020.
  12. MMDeploy Contributors. Openmmlab’s model deployment toolbox. https://github.com/open-mmlab/mmdeploy, 2021.
  13. Imagenet: A large-scale hierarchical image database. Computer Vision and Pattern Recognition, 2009.
  14. Improved regularization of convolutional neural networks with cutout. arXiv: Computer Vision and Pattern Recognition, 2017.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  16. Soft labels for ordinal regression. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  17. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
  18. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  19. Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14676–14686, 2021.
  20. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
  21. Deep residual learning for image recognition. Cornell University - arXiv, 2015.
  22. Single-network whole-body pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6982–6991, 2019.
  23. Transformer quality in linear time. ArXiv, abs/2202.10447, 2022.
  24. The devil is in the details: Delving into unbiased data processing for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  25. The devil is in the details: Delving into unbiased data processing for human pose estimation. In CVPR, pages 5700–5709, 2020.
  26. Aid: Pushing the performance boundary of human pose estimation with information dropping augmentation, 2020.
  27. Differentiable hierarchical graph grouping for multi-person pose estimation. In European Conference on Computer Vision, pages 718–734. Springer, 2020.
  28. Whole-body human pose estimation in the wild, 2020.
  29. Pifpaf: Composite fields for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11977–11986, 2019.
  30. Human pose regression with residual log-likelihood estimation. In ICCV, 2021.
  31. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. arXiv preprint arXiv:1812.00324, 2018.
  32. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. Cornell University - arXiv, 2018.
  33. Pose recognition with cascade transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1944–1953, 2021.
  34. Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527, 2022.
  35. Simcc: a simple coordinate classification perspective for human pose estimation, 2021.
  36. Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11313–11322, 2021.
  37. Microsoft COCO: Common objects in context. In ECCV, 2014.
  38. Polarized self-attention: towards high-quality pixel-wise regression. arXiv preprint arXiv:2107.00782, 2021.
  39. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  40. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
  41. Rethinking the heatmap regression for bottom-up human pose estimation. In CVPR, pages 13264–13273, 2021.
  42. Rtmdet: An empirical study of designing real-time object detectors, 2022.
  43. Poseur: Direct human pose regression with transformers. In European Conference on Computer Vision, pages 72–88. Springer, 2022.
  44. Associative embedding: End-to-end learning for joint detection and grouping. NIPS, 30, 2017.
  45. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, pages 4929–4937, 2016.
  46. RangiLyu. Nanodet-plus: Super fast and high accuracy lightweight anchor-free object detection model. https://github.com/RangiLyu/nanodet, 2021.
  47. Yolov3: An incremental improvement. arXiv: Computer Vision and Pattern Recognition, 2018.
  48. Faster r-cnn: Towards real-time object detection with region proposal networks. Cornell University - arXiv, 2015.
  49. Mobilenetv2: Inverted residuals and linear bottlenecks. Cornell University - arXiv, 2018.
  50. Noam Shazeer. Glu variants improve transformer, 2020.
  51. End-to-end multi-person pose estimation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11069–11078, 2022.
  52. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
  53. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  54. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
  55. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  56. Next-generation pose detection with movenet and tensorflow.js. 2023.
  57. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv: Computer Vision and Pattern Recognition, 2021.
  58. Ai challenger : A large-scale dataset for going deeper in image understanding. arXiv: Computer Vision and Pattern Recognition, 2017.
  59. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), 2018.
  60. Zoomnas: Searching for whole-body human pose estimation in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  61. Vitpose: Simple vision transformer baselines for human pose estimation, 2022.
  62. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11802–11812, 2021.
  63. Lite-hrnet: A lightweight high-resolution network. In CVPR, 2021.
  64. Ap-10k: A benchmark for animal pose estimation in the wild. Cornell University - arXiv, 2021.
  65. Distribution-aware coordinate representation for human pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  66. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890, 2021.
  67. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
Citations (105)

Summary

  • The paper leverages a top-down approach with RTMDet and a CSPNeXt backbone to overcome detection latency and boost pose estimation accuracy.
  • It introduces SimCC, converting keypoint localization into a classification task that reduces computational cost versus traditional heatmap methods.
  • Extensive tests across CPUs, GPUs, and mobile devices show RTMPose achieves over 430 FPS, supporting real-time industrial applications.

RTMPose: Enhancing Real-Time Multi-Person Pose Estimation

The paper "RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose" addresses the challenges and requirements of efficient pose estimation for industrial applications. The authors propose a comprehensive framework named RTMPose, which redefines multi-person 2D pose estimation by optimizing model architecture, training strategies, and deployment processes. This essay explores the core components of the paper and analyzes its implications for real-time applications.

Key Contributions

The research presents RTMPose as an optimization over existing pose estimation methodologies, specifically focusing on bridging the gap between academic benchmarks and industrial performance requirements.

  1. Model Architecture and Paradigm: RTMPose utilizes a top-down approach, known for accuracy but typically hindered by detection latency. By leveraging efficient real-time detectors like RTMDet, the authors effectively eliminate detection as a bottleneck. The implementation of CSPNeXt as the backbone ensures a balance between computational cost and accuracy.
  2. Coordinate Classification with SimCC: The paper introduces an innovative approach by utilizing SimCC for keypoint localization, transforming it into a classification task. This results in reduced computational effort compared to conventional heatmap-based methods and facilitates deployment across diverse platforms.
  3. Training Enhancements: The researchers systematically refine training strategies, employing a two-stage augmentation technique and strategic optimization processes. These adjustments yield significant performance improvements across the RTMPose models.
  4. Real-Time Deployment: Extensive tests on multiple hardware setups, including CPUs, GPUs, and mobile devices, underline the framework's flexibility and efficiency. RTMPose achieves notable speeds, with RTMPose-m operating beyond 430 FPS on an NVIDIA GTX 1660 Ti, thereby outpacing existing solutions.

Numerical Results

RTMPose demonstrates compelling performance across various datasets:

  • COCO: The RTMPose-m attained 75.8% AP with a notable 90+ FPS on CPU, while RTMPose-x achieved 65.3% AP on COCO-WholeBody. These metrics highlight a strong performance-speed trade-off crucial for practical applications.
  • COCO-SinglePerson and CrowdPose: The models outperformed existing alternatives tailored for specific single-person scenarios, reinforcing the versatility of RTMPose.
  • Inference Pipeline Efficiency: Results from Snapdragon and Intel devices showcased significant improvements in latency, enabling RTMPose to support real-time applications effectively.

Practical and Theoretical Implications

The advancements presented in RTMPose elevate its relevance for applications demanding real-time pose estimation such as augmented reality, human-computer interaction, and surveillance systems. The techniques refined here could inspire further research into efficient model architectures tailored for constrained and resource-lean environments.

Future Directions

Further exploration into integrating more advanced neural network architectures like transformers could potentially enhance the spatial understanding in pose estimation tasks. Additionally, leveraging multi-task learning could be a path to simultaneously address related tasks such as activity recognition and semantic segmentation.

In conclusion, the RTMPose framework represents a significant step forward in bridging the divide between theoretical advancements in pose estimation and their practical applicability. Its performance metrics and versatile architecture offer a robust foundation for future innovations aimed at enhancing real-time human pose estimation capabilities.