Adversarial PoseNet: A Structure-aware Convolutional Network for Human Pose Estimation (1705.00389v2)

Published 30 Apr 2017 in cs.CV

Abstract: For human pose estimation in monocular images, joint occlusions and overlapping upon human bodies often result in deviated pose predictions. Under these circumstances, biologically implausible pose predictions may be produced. In contrast, human vision is able to predict poses by exploiting geometric constraints of joint inter-connectivity. To address the problem by incorporating priors about the structure of human bodies, we propose a novel structure-aware convolutional network to implicitly take such priors into account during training of the deep network. Explicit learning of such constraints is typically challenging. Instead, we design discriminators to distinguish the real poses from the fake ones (such as biologically implausible ones). If the pose generator (G) generates results that the discriminator fails to distinguish from real ones, the network successfully learns the priors.

Citations (313)

View on Semantic Scholar

Summary

The paper introduces a GAN-based adversarial framework that enforces biologically plausible human joint structures in pose estimation.
It leverages a generator and discriminator network to predict poses and occlusion heatmaps, achieving a 2.4% mean PCK improvement on the LSP dataset.
The method's strong performance under occlusion highlights its potential applications in human-computer interaction, surveillance, and augmented reality.

An Examination of "Adversarial PoseNet: A Structure-aware Convolutional Network for Human Pose Estimation"

The paper, titled "Adversarial PoseNet: A Structure-aware Convolutional Network for Human Pose Estimation," proposes a novel approach to mitigate the challenges faced in estimating human poses from monocular images, particularly under scenarios of joint occlusions and overlapping. These scenarios often lead to less accurate and biologically implausible pose predictions. The authors introduce a structure-aware convolutional network, inspired by the human visual system's ability to infer poses using geometric constraints of joint inter-connectivity.

Methodology and Architecture

To incorporate priors about human body structure without explicitly learning geometric constraints, the authors employ an adversarial framework. This framework consists of a generator network and a set of discriminator networks. The generator network (G) is responsible for predicting poses and occlusion heatmaps in a stacked multi-task manner. The discriminators are tasked with distinguishing between real and fake (biologically implausible) poses. The introduction of discriminators aligns with the strategy used in Generative Adversarial Networks (GANs), where the generator is trained to produce output indistinguishable from real data by the discriminator.

The architecture leverages a conditional GAN approach to enhance its capability. The adversarial nature of the training enables G to learn robust representations, ensuring that pose predictions adhere to human body structural constraints. This structure-aware approach is highly significant as it substantially reduces errors introduced by visual complexities such as occlusions and ambiguous backgrounds.

Results and Comparisons

The network's efficacy is assessed on benchmark datasets: the extended Leeds Sports Poses (LSP) and the MPII Human Pose datasets. It is observed that the proposed method significantly outperforms existing state-of-the-art methods across multiple metrics. On the LSP dataset, the approach achieves a mean PCK improvement of 2.4% over the nearest competitor, demonstrating its superior accuracy in challenging conditions. Furthermore, on the MPII dataset, the method attains a significant boost in [email protected], particularly excelling in the prediction of wrist and ankle joints, which often involve more complex articulations and higher risk of occlusions.

Implications and Future Directions

The introduction of a structure-aware framework for pose estimation has both theoretical and practical implications. Theoretically, it showcases the efficacy of GANs in weakly supervised learning tasks where structured output is critical. Practically, the robust predictions, even under occluded scenarios, have potential applicability in various domains, including human-computer interaction, surveillance, and augmented reality.

Moving forward, this adversarial approach opens avenues for exploration in other areas requiring structural awareness in predictions, such as facial landmark detection. Additionally, extending the framework to accommodate temporal information in video sequences could lead to further advancements in understanding dynamic human actions.

Overall, "Adversarial PoseNet" marks a significant stride toward achieving higher fidelity in human pose estimation by leveraging the adversarial learning paradigm to impose implicit structural constraints. This work could inspire future research to explore similar adversarial strategies in other structure-dependent prediction tasks within the scope of computer vision and beyond.