Rethinking the Heatmap Regression for Bottom-up Human Pose Estimation (2012.15175v4)

Published 30 Dec 2020 in cs.CV

Abstract: Heatmap regression has become the most prevalent choice for nowadays human pose estimation methods. The ground-truth heatmaps are usually constructed via covering all skeletal keypoints by 2D gaussian kernels. The standard deviations of these kernels are fixed. However, for bottom-up methods, which need to handle a large variance of human scales and labeling ambiguities, the current practice seems unreasonable. To better cope with these problems, we propose the scale-adaptive heatmap regression (SAHR) method, which can adaptively adjust the standard deviation for each keypoint. In this way, SAHR is more tolerant of various human scales and labeling ambiguities. However, SAHR may aggravate the imbalance between fore-background samples, which potentially hurts the improvement of SAHR. Thus, we further introduce the weight-adaptive heatmap regression (WAHR) to help balance the fore-background samples. Extensive experiments show that SAHR together with WAHR largely improves the accuracy of bottom-up human pose estimation. As a result, we finally outperform the state-of-the-art model by +1.5AP and achieve 72.0AP on COCO test-dev2017, which is com-arable with the performances of most top-down methods. Source codes are available at https://github.com/greatlog/SWAHR-HumanPose.

Citations (120)

View on Semantic Scholar

Summary

The paper introduces two novel methods, SAHR and WAHR, that adapt Gaussian kernel scales and loss weights for more precise human pose estimation.
The proposed techniques effectively address scale variations and class imbalance, achieving a 72.0 AP on the COCO test-dev2017 dataset and outperforming top-down approaches.
Empirical results demonstrate the model's robustness in crowded scenes, paving the way for more adaptable multi-task computer vision applications.

Rethinking Heatmap Regression for Bottom-up Human Pose Estimation

The paper "Rethinking the Heatmap Regression for Bottom-up Human Pose Estimation" introduces a method to address challenges in the existing heatmap regression approach for bottom-up human pose estimation (HPE). This methodological innovation focuses on key issues like scale variations and labeling ambiguities. The authors introduce two novel regression methods: Scale-Adaptive Heatmap Regression (SAHR) and Weight-Adaptive Heatmap Regression (WAHR). These methods collaboratively enhance the accuracy of human pose estimation by dynamically adjusting to the scales and difficulties associated with different keypoints and samples.

Scale-Adaptive Heatmap Regression (SAHR)

SAHR addresses the problem of fixed standard deviations in Gaussian kernels used for constructing ground-truth heatmaps. Given that bottom-up HPE must handle varying human scales and labeling precision, such a one-size-fits-all approach becomes inadequate. The authors propose allowing the model to learn and predict scale maps that adjust the standard deviations dynamically per keypoint. This adjustment not only accommodates more significant variances in human scales but also accounts for labeling ambiguities inherent in manual keypoint annotation.

The implementation of SAHR involves augmenting the heatmap regression with an additional branch that predicts scale maps. These scale maps modify the standard deviations of the Gaussian kernels tailored to respective keypoints, effectively allowing the model to learn spatial and semantic relations suited to each keypoint's variance and uncertainty.

Weight-Adaptive Heatmap Regression (WAHR)

The second method, WAHR, tackles the imbalance problem between foreground and background samples prominent in heatmap regression. Building on the principles of focal loss used in classification tasks, WAHR introduces a weighting mechanism that assigns less weight to well-classified samples and focuses more on difficult samples. This targeted attention assists the model in refining its predictions where there is notable confusion or overlap between person instances, especially in crowded scenes.

Weight adaption in heatmap regression is crucial because the majority of pixel values in a heatmap are zeros, inevitably leading to a model biased towards the background. By modifying the loss weighting dynamically, WAHR ensures that learning focuses more on the challenging keypoint predictions, thereby refining pose detection accuracy.

Empirical Results and Implications

The empirical results on the COCO test-dev2017 dataset demonstrate a significant improvement, with the proposed framework achieving a $72.0 AP$, surpassing several competitive top-down methods. This improvement is particularly impressive given that bottom-up methods often struggle with occlusion and scale variation compared to top-down approaches.

The paper also investigates the influence of parameters governing the scale and weight adaptations, revealing robustness across different settings. The results on the CrowdPose dataset, characterized by more crowded scenes, further highlight the effectiveness of this approach, showing a considerable gain over state-of-the-art models.

Conclusions and Future Directions

The proposed methods, SAHR and WAHR, represent advancements in bottom-up HPE through nuanced handling of scale variation and example weighting. They present a path forward for more adaptable, efficient, and robust human pose estimation models that perform well across diverse and challenging scenarios.

Future developments could explore further tuning of scale and weight parameters or expansion into multi-task learning frameworks where pose estimation complements other vision tasks like action recognition or scene understanding. Addressing computational efficiency and deploying these methods in real-time systems could also be areas of continued research and potential impact.

PDF Markdown

Related Papers

GitHub

GitHub - greatlog/SWAHR-HumanPose: Bottom-up Human Pose Estimation (124 stars)