AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild

Published 26 Oct 2020 in cs.CV | (2010.13302v1)

Abstract: Occlusion is probably the biggest challenge for human pose estimation in the wild. Typical solutions often rely on intrusive sensors such as IMUs to detect occluded joints. To make the task truly unconstrained, we present AdaFuse, an adaptive multiview fusion method, which can enhance the features in occluded views by leveraging those in visible views. The core of AdaFuse is to determine the point-point correspondence between two views which we solve effectively by exploring the sparsity of the heatmap representation. We also learn an adaptive fusion weight for each camera view to reflect its feature quality in order to reduce the chance that good features are undesirably corrupted by ``bad'' views. The fusion model is trained end-to-end with the pose estimation network, and can be directly applied to new camera configurations without additional adaptation. We extensively evaluate the approach on three public datasets including Human3.6M, Total Capture and CMU Panoptic. It outperforms the state-of-the-arts on all of them. We also create a large scale synthetic dataset Occlusion-Person, which allows us to perform numerical evaluation on the occluded joints, as it provides occlusion labels for every joint in the images. The dataset and code are released at https://github.com/zhezh/adafuse-3d-human-pose.

Abstract PDF Upgrade to Chat

Citations (73)

View on Semantic Scholar

Summary

AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild

In the field of computer vision, human pose estimation presents considerable challenges due to factors such as occlusion, background clutter, and human appearance variations. The paper titled "AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild" introduces a novel approach to address the issue of occlusion using adaptive multiview fusion without relying on intrusive sensors like IMUs. The authors present AdaFuse, a method designed to enhance features in occluded views with input from visible views, thereby improving the accuracy and robustness of pose estimation in unconstrained environments.

The primary innovation in AdaFuse lies in its ability to determine point-to-point correspondence between multiple camera views effectively through the exploration of heatmap representation sparsity. The fusion methodology does not require retraining for new camera setups as it integrates with the pose estimation network and can be applied directly to unfamiliar camera configurations. This adaptability contrasts with many existing state-of-the-art techniques that demand reconfiguration for different environments.

The researchers employ AdaFuse for evaluation on three prominent datasets: Human3.6M, Total Capture, and CMU Panoptic, achieving superior performance compared to prior methods on all datasets. Furthermore, the authors present a synthetic dataset, Occlusion-Person, enriched with human-object occlusion labels, allowing for extensive numerical evaluation under occlusion conditions.

AdaFuse also introduces a mechanism for adaptive fusion, wherein it learns specific weights for fusion based on the quality of each view's features. This is particularly advantageous in reducing the impact of corrupted features from 'bad' or low-quality views. The fusion model, hence, is trained concurrently with the pose estimation network, allowing it to leverage cross-view correspondence effectively.

The numerical results demonstrated by AdaFuse are substantial. For instance, on the Human3.6M dataset, AdaFuse achieved a mean 3D pose estimation error reduction from 22.9mm using a strong baseline (NoFuse) to 19.5mm. This decrease denotes a significant advancement, especially given the already competitive baseline. Additionally, the approach shows promising improvements on artificially generated datasets where high rates of occlusion are present.

In practical terms, the implications of AdaFuse are profound. This technique not only fosters improvements in traditional applications like augmented and virtual reality but also opens possibilities for advanced human-computer interaction and intelligent player analysis in sports scenarios where clean data acquisition is hampered by occlusion.

Theoretical implications suggest that AdaFuse's adaptive weights emblazon feature fusion processes, which are beneficial not only for handling occlusion but also in potentially advancing model-free estimation methods in unconstrained environments.

Future research could cast light on the integration of temporal information to further bolster the performance of AdaFuse. This could manifest advancements in real-time applications requiring simultaneity between action and model update. Overall, AdaFuse signifies an assertive stride towards more dependable pose estimation processes in challenging real-world scenarios.