Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose (1811.12004v1)

Published 29 Nov 2018 in cs.CV

Abstract: In this work we adapt multi-person pose estimation architecture to use it on edge devices. We follow the bottom-up approach from OpenPose, the winner of COCO 2016 Keypoints Challenge, because of its decent quality and robustness to number of people inside the frame. With proposed network design and optimized post-processing code the full solution runs at 28 frames per second (fps) on Intel$\unicode{xAE}$ NUC 6i7KYB mini PC and 26 fps on Core$^{TM}$ i7-6850K CPU. The network model has 4.1M parameters and 9 billions floating-point operations (GFLOPs) complexity, which is just ~15% of the baseline 2-stage OpenPose with almost the same quality. The code and model are available as a part of Intel$\unicode{xAE}$ OpenVINO$^{TM}$ Toolkit.

Citations (270)

View on Semantic Scholar

Summary

The paper introduces a CPU-optimized OpenPose with a dilated MobileNet backbone that reduces GFLOPs from 136.1 to 9, ensuring fast inference.
It streamlines key refinement and post-processing stages, boosting processing speed from 1.54 fps to up to 28 fps on standard CPUs.
The approach maintains near-baseline accuracy with less than a 1% AP drop, enabling practical deployment on edge devices for surveillance, interaction, and sports analytics.

Real-time 2D Multi-Person Pose Estimation on CPU: A Technical Overview

The paper presents an optimization of the OpenPose architecture for real-time multi-person pose estimation on edge devices, particularly CPUs. Leveraging the bottom-up approach from the original OpenPose, the proposed method achieves significant improvements in computation efficiency while maintaining high accuracy levels, enabling its deployment on devices with limited processing power, such as mini PCs and standard CPUs.

Methodological Advancements

The primary contribution of this work lies in optimizing both the network design and the post-processing algorithm used in multi-person pose estimation. The network is fine-tuned to operate efficiently on CPUs by adopting a lightweight architecture without compromising performance.

Network Design Optimizations:
- Backbone Adaptation: The paper replaces the traditional VGG-19 backbone with a dilated MobileNet v1. The MobileNet architecture is augmented with dilated convolutions to balance between network depth and spatial resolution. This modification results in a significant reduction in network complexity, measured in GFLOPs, from 136.1 GFLOPs in the baseline to just 9 GFLOPs, which constitutes approximately 15% of the original computational load.
- Refinement Stages Simplification: The authors streamline the refinement stages by reducing them to two stages, the initial and a lightweight refinement stage, employing a single prediction branch instead of separate branches for keypoints and pair affinity fields. This reduces the computational cost further, while a retraining process with all stages provides a regularizing effect that minimizes performance degradation.
Post-processing Enhancements:
- The grouping and keypoint extraction phases were subjected to performance profiling and optimization, primarily through removing unnecessary memory allocations and parallelizing processes using OpenCV. This leads to substantial improvements in processing speed, specifically from 1.54 fps to 26 fps on a standard CPU.
Inference Strategy:
- Implementation with the Intel OpenVINO Toolkit enables the network to perform inference efficiently across various hardware, including GPU, FPGA, and notably on CPUs, maximizing flexibility and application scope. The system achieves real-time performance, running at 28 fps on Intel NUC and 26 fps on a Core i7 CPU.

Empirical Results and Performance

The experimental results demonstrate that the optimized network achieves a near-identical average precision (AP) compared to the baseline 2-stage OpenPose. Notably, the optimized network retains accuracy with less than a 1% AP drop despite considerable computational simplification. The solution processes over 20 poses per frame in a challenging video benchmark efficiently, highlighting its robustness and applicability in scenarios requiring rapid inference.

Practical Implications and Future Directions

This work significantly widens the scope of real-time human pose estimation deployment in situations where access to powerful hardware, such as GPUs, is restrictive. Consequently, applications such as surveillance, human-computer interaction, and sports analytics on mobile or embedded systems become feasible.

Future research might explore further optimizations through techniques like quantization, pruning, and knowledge distillation, promising to enhance performance and reduce model size further. Additionally, adapting this architecture for different input resolutions and varying numbers of detected individuals could offer versatile utility in real-world applications.

Overall, this paper's contributions are pivotal in advancing the efficiency and applicability of pose estimation frameworks on edge computing platforms, evidencing a practical approach towards democratizing access to robust AI capabilities in multi-person detection and tracking tasks.

PDF Markdown