RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation (1902.09868v2)

Published 26 Feb 2019 in cs.CV

Abstract: This paper addresses the problem of 3D human pose estimation from single images. While for a long time human skeletons were parameterized and fitted to the observation by satisfying a reprojection error, nowadays researchers directly use neural networks to infer the 3D pose from the observations. However, most of these approaches ignore the fact that a reprojection constraint has to be satisfied and are sensitive to overfitting. We tackle the overfitting problem by ignoring 2D to 3D correspondences. This efficiently avoids a simple memorization of the training data and allows for a weakly supervised training. One part of the proposed reprojection network (RepNet) learns a mapping from a distribution of 2D poses to a distribution of 3D poses using an adversarial training approach. Another part of the network estimates the camera. This allows for the definition of a network layer that performs the reprojection of the estimated 3D pose back to 2D which results in a reprojection loss function. Our experiments show that RepNet generalizes well to unknown data and outperforms state-of-the-art methods when applied to unseen data. Moreover, our implementation runs in real-time on a standard desktop PC.

Authors (2)

Bastian Wandt (30 papers)
Bodo Rosenhahn (96 papers)

Citations (226)

View on Semantic Scholar

Summary

The paper presents a novel weakly supervised framework using adversarial training and reprojection loss to robustly infer 3D human poses.
It integrates a camera estimation module and a kinematic chain space to enhance generalization and reduce symmetry errors under noisy conditions.
Empirical results on Human3.6M and MPI-INF-3DHP demonstrate superior performance over fully supervised methods, highlighting its scalable potential.

Overview of RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation

The paper "RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation," authored by Bastian Wandt and Bodo Rosenhahn, introduces an innovative approach for 3D human pose estimation from single images. The authors aim to address the challenges inherent in direct 3D inference from 2D images by leveraging a weakly supervised training regime that eschews the need for explicit 2D-3D correspondences. This is pivotal in mitigating overfitting issues commonly encountered in existing methodologies.

Key Methodological Contributions

The paper delineates a novel method leveraging adversarial training to map distributions of 2D poses to 3D poses. The system, termed RepNet, comprises three integral components: a 3D pose estimation branch, a camera estimation network, and a reprojection layer. The reprojection constraint—a longstanding principle in geometric vision—is embedded in the network architecture, enabling the reprojection of 3D pose estimates back to 2D. This introduces a form of reprojection loss that regularizes the network's predictions, ultimately enhancing the robustness and generalization capacity of the model.

The RepNet employs adversarial networks wherein a generator network, tasked with pose estimation, is trained in conjunction with a critic network. The critic, akin to a discriminator in GAN frameworks, is enriched with features derived from the kinematic chain space (KCS), which captures joint angles and bone lengths, thereby improving the recognition of plausibly human poses.

Experimental Analysis

Empirical evaluations reveal that RepNet achieves notable performance on standard datasets, particularly Human3.6M and MPI-INF-3DHP. Despite the lack of full supervision, RepNet outperforms several fully supervised counterparts, particularly when tested on unseen data. The method is designed to be computationally efficient, offering near real-time inference capabilities, which underscores its applicability in time-sensitive contexts like interactive systems.

Furthermore, the integration of kinematic constraints via the KCS layer is shown to enhance the network's ability to generate plausible human poses, significantly reducing symmetry errors even in noisy conditions, as demonstrated in detailed experimental scenarios.

Implications and Future Directions

The adaptation of adversarial training paradigms to 3D pose estimation in a weak supervision context highlights substantial implications for advancing the robustness of vision systems dealing with incomplete training data. The model's ability to function without explicit 2D-3D mappings suggests a pathway towards scalable systems that can learn effectively across diverse and unstructured datasets.

Future research might explore extending RepNet's framework to dynamic sequences, possibly integrating temporal coherence constraints to tackle more complex motion capture tasks. Additionally, enhancements could contemplate leveraging more sophisticated camera models or multi-view configurations to further diversify its applicability to broader vision-based applications.

In summary, this work contributes a significant advancement in weakly supervised neural networks for 3D pose estimation, with promising potential to reshape methodologies in both academic research and practical applications.

PDF Markdown