RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose (2303.07399v2)

Published 13 Mar 2023 in cs.CV

Abstract: Recent studies on 2D pose estimation have achieved excellent performance on public benchmarks, yet its application in the industrial community still suffers from heavy model parameters and high latency. In order to bridge this gap, we empirically explore key factors in pose estimation including paradigm, model architecture, training strategy, and deployment, and present a high-performance real-time multi-person pose estimation framework, RTMPose, based on MMPose. Our RTMPose-m achieves 75.8% AP on COCO with 90+ FPS on an Intel i7-11700 CPU and 430+ FPS on an NVIDIA GTX 1660 Ti GPU, and RTMPose-l achieves 67.0% AP on COCO-WholeBody with 130+ FPS. To further evaluate RTMPose's capability in critical real-time applications, we also report the performance after deploying on the mobile device. Our RTMPose-s achieves 72.2% AP on COCO with 70+ FPS on a Snapdragon 865 chip, outperforming existing open-source libraries. Code and models are released at https://github.com/open-mmlab/mmpose/tree/1.x/projects/rtmpose.

References (67)

Citations (105)

View on Semantic Scholar

Summary

The paper leverages a top-down approach with RTMDet and a CSPNeXt backbone to overcome detection latency and boost pose estimation accuracy.
It introduces SimCC, converting keypoint localization into a classification task that reduces computational cost versus traditional heatmap methods.
Extensive tests across CPUs, GPUs, and mobile devices show RTMPose achieves over 430 FPS, supporting real-time industrial applications.

RTMPose: Enhancing Real-Time Multi-Person Pose Estimation

The paper "RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose" addresses the challenges and requirements of efficient pose estimation for industrial applications. The authors propose a comprehensive framework named RTMPose, which redefines multi-person 2D pose estimation by optimizing model architecture, training strategies, and deployment processes. This essay explores the core components of the paper and analyzes its implications for real-time applications.

Key Contributions

The research presents RTMPose as an optimization over existing pose estimation methodologies, specifically focusing on bridging the gap between academic benchmarks and industrial performance requirements.

Model Architecture and Paradigm: RTMPose utilizes a top-down approach, known for accuracy but typically hindered by detection latency. By leveraging efficient real-time detectors like RTMDet, the authors effectively eliminate detection as a bottleneck. The implementation of CSPNeXt as the backbone ensures a balance between computational cost and accuracy.
Coordinate Classification with SimCC: The paper introduces an innovative approach by utilizing SimCC for keypoint localization, transforming it into a classification task. This results in reduced computational effort compared to conventional heatmap-based methods and facilitates deployment across diverse platforms.
Training Enhancements: The researchers systematically refine training strategies, employing a two-stage augmentation technique and strategic optimization processes. These adjustments yield significant performance improvements across the RTMPose models.
Real-Time Deployment: Extensive tests on multiple hardware setups, including CPUs, GPUs, and mobile devices, underline the framework's flexibility and efficiency. RTMPose achieves notable speeds, with RTMPose-m operating beyond 430 FPS on an NVIDIA GTX 1660 Ti, thereby outpacing existing solutions.

Numerical Results

RTMPose demonstrates compelling performance across various datasets:

COCO: The RTMPose-m attained 75.8% AP with a notable 90+ FPS on CPU, while RTMPose-x achieved 65.3% AP on COCO-WholeBody. These metrics highlight a strong performance-speed trade-off crucial for practical applications.
COCO-SinglePerson and CrowdPose: The models outperformed existing alternatives tailored for specific single-person scenarios, reinforcing the versatility of RTMPose.
Inference Pipeline Efficiency: Results from Snapdragon and Intel devices showcased significant improvements in latency, enabling RTMPose to support real-time applications effectively.

Practical and Theoretical Implications

The advancements presented in RTMPose elevate its relevance for applications demanding real-time pose estimation such as augmented reality, human-computer interaction, and surveillance systems. The techniques refined here could inspire further research into efficient model architectures tailored for constrained and resource-lean environments.

Future Directions

Further exploration into integrating more advanced neural network architectures like transformers could potentially enhance the spatial understanding in pose estimation tasks. Additionally, leveraging multi-task learning could be a path to simultaneously address related tasks such as activity recognition and semantic segmentation.

In conclusion, the RTMPose framework represents a significant step forward in bridging the divide between theoretical advancements in pose estimation and their practical applicability. Its performance metrics and versatile architecture offer a robust foundation for future innovations aimed at enhancing real-time human pose estimation capabilities.