Generating Multiple Hypotheses for 3D Human Pose Estimation with Mixture Density Network (1904.05547v1)

Published 11 Apr 2019 in cs.CV

Abstract: 3D human pose estimation from a monocular image or 2D joints is an ill-posed problem because of depth ambiguity and occluded joints. We argue that 3D human pose estimation from a monocular input is an inverse problem where multiple feasible solutions can exist. In this paper, we propose a novel approach to generate multiple feasible hypotheses of the 3D pose from 2D joints.In contrast to existing deep learning approaches which minimize a mean square error based on an unimodal Gaussian distribution, our method is able to generate multiple feasible hypotheses of 3D pose based on a multimodal mixture density networks. Our experiments show that the 3D poses estimated by our approach from an input of 2D joints are consistent in 2D reprojections, which supports our argument that multiple solutions exist for the 2D-to-3D inverse problem. Furthermore, we show state-of-the-art performance on the Human3.6M dataset in both best hypothesis and multi-view settings, and we demonstrate the generalization capacity of our model by testing on the MPII and MPI-INF-3DHP datasets. Our code is available at the project website.

Citations (191)

View on Semantic Scholar

Summary

The paper proposes a novel MDN framework that generates multiple 3D pose estimations to address depth and occlusion ambiguities.
It employs a multimodal Gaussian mixture model with five kernels, achieving an average MPJPE of 52.7mm on the Human3.6M dataset.
The approach demonstrates robustness under occlusion and generalizes well across datasets including MPII and MPI-INF-3DHP.

Insights into Generating Multiple Hypotheses for 3D Human Pose Estimation

The paper "Generating Multiple Hypotheses for 3D Human Pose Estimation with Mixture Density Network" by Chen Li and Gim Hee Lee addresses the inherent ambiguity in 3D human pose estimation from 2D joint data. This paper adopts a novel approach, utilizing a mixture density network (MDN) to generate multiple hypotheses for 3D pose estimation, thereby providing a plausible resolution to the fundamentally ill-posed nature of the problem due to depth ambiguity and occluded joints.

Background and Methodology

Traditional methods for 3D pose estimation often rely on minimizing the error between estimated poses and ground truth using a unimodal Gaussian distribution assumption. However, this approach overlooks the inverse nature of the problem, which allows for multiple feasible 3D poses corresponding to the same 2D projection. The authors introduce the MDN, which, in contrast to previous approaches, uses a multimodal Gaussian mixture model to represent this conditional uncertainty.

The MDN framework comprises a feature extractor for lifting 2D joints into higher-dimensional feature spaces and a hypotheses generator that leverages these features to output multiple 3D pose estimations. By default, five Gaussian kernels are utilized, and their mixing coefficients, means, and variances provide the parameters of the resulting 3D pose hypotheses.

Numerical and Hypothesis Validation

The paper presents experimental results on the Human3.6M dataset, demonstrating that the chosen approach not only provides competitive results in single-view settings but also outperforms previous state-of-the-art methods in multi-view scenarios. The MDN model best hypothesis averaging an MPJPE of 52.7mm, a significant improvement over methods such as Martinez et al. and the method proposed by Lee et al.

Furthermore, the authors validate their hypotheses by showing the consistency of 3D reprojections in 2D space. Each hypothesis' projection aligns closely with 2D input joints, further supporting the argument for the presence of multiple feasible 3D solutions per 2D frame.

Robustness and Generalization

In addition to standard evaluations, the paper explores the robustness of their model under occlusion conditions, simulating scenarios where one or two joints are missing. The hypotheses generated under these conditions exhibit notable resilience, indicating the model's robustness.

Generalization beyond the Human3.6M dataset was assessed using the MPII and MPI-INF-3DHP datasets. The paper showed that the MDN approach achieves competitive performance without retraining, highlighting its capability to generalize across different domains, including indoor and outdoor environments.

Implications and Future Directions

This paper presents critical methodological advances for addressing the intrinsic ambiguity in monocular 3D human pose estimation. Utilizing an MDN framework for hypothesis generation offers a promising direction for overcoming limitations of single prediction models in ambiguous scenarios. Given the high dimensionality and potential for domain adaptation demonstrated here, future work could explore the integration of MDNs with real-time systems and adaptive learning paradigms in unconstrained environments.

Furthermore, potential extensions could include augmenting this approach with temporal data integration, thus introducing a temporal coherence constraint into the MDN framework, thereby enhancing predictions on dynamic activities over video sequences.

In conclusion, the proposed approach by Chen Li and Gim Hee Lee sets a new standard in 3D human pose estimation, providing both practical applications in surveillance and theoretical grounds for continued exploration in multimodal AI estimations.

PDF Markdown