3D Human Pose Estimation from a Single Image via Distance Matrix Regression (1611.09010v1)

Published 28 Nov 2016 in cs.CV

Abstract: This paper addresses the problem of 3D human pose estimation from a single image. We follow a standard two-step pipeline by first detecting the 2D position of the $N$ body joints, and then using these observations to infer 3D pose. For the first step, we use a recent CNN-based detector. For the second step, most existing approaches perform 2$N$-to-3$N$ regression of the Cartesian joint coordinates. We show that more precise pose estimates can be obtained by representing both the 2D and 3D human poses using $N\times N$ distance matrices, and formulating the problem as a 2D-to-3D distance matrix regression. For learning such a regressor we leverage on simple Neural Network architectures, which by construction, enforce positivity and symmetry of the predicted matrices. The approach has also the advantage to naturally handle missing observations and allowing to hypothesize the position of non-observed joints. Quantitative results on Humaneva and Human3.6M datasets demonstrate consistent performance gains over state-of-the-art. Qualitative evaluation on the images in-the-wild of the LSP dataset, using the regressor learned on Human3.6M, reveals very promising generalization results.

Citations (384)

View on Semantic Scholar

Collections

Summary

The paper introduces a regression framework that maps 2D EDMs to 3D EDMs, reducing the ambiguity in projecting 3D poses from single images.
It employs both Fully Connected and Fully Convolutional networks to efficiently encode structural relationships and handle occlusions.
Empirical evaluations on Humaneva, Human3.6M, and Leeds datasets demonstrate significant error reductions and robust generalizability.

3D Human Pose Estimation from a Single Image via Distance Matrix Regression

The paper "3D Human Pose Estimation from a Single Image via Distance Matrix Regression" presents a novel approach to the problem of estimating 3D human poses from single RGB images. The method follows a two-phase pipeline beginning with the detection of 2D joint positions using a Convolutional Neural Network (CNN) and subsequently inferring 3D poses. The unique contribution lies in adopting a distance matrix approach that represents both 2D and 3D poses in terms of Euclidean Distance Matrices (EDMs), a departure from the conventional vector representation of Cartesian joint coordinates. This new formulation is capable of reducing the intrinsic ambiguity in the projection of 3D poses into 2D image planes.

Methodology

The researchers propose a regression framework between two EDMs, employing Neural Network architectures to predict 3D EDMs from 2D EDMs. This shift in representation aligns well with the structural properties of human poses, bringing several advantages:

Structural Encoding: EDMs naturally capture the correlations and dependencies between joints, eliminating the need for explicit constraint formulations required in Cartesian-based representations.
Invariant Properties: EDMs are coordinate-free, invariant to in-plane rotations, translations, and through normalization, invariant to scaling.
Handling Occlusions: The approach accommodates missing observations and can hypothesize non-observed joints efficiently.

The learning paradigm leverages simple Neural Network structures to map 2D to 3D EDMs. Two architectural configurations are explored: Fully Connected (FConn) Networks and Fully Convolutional (FConv) Networks. Given the reduced dimensionality of the EDMs, these networks are shallow yet effective in encoding the mapping function, with the FConv network also proving adept at extrapolating occluded body parts.

Empirical Evaluation

Quantitative assessments demonstrate significant reductions in error compared to existing methods on the Humaneva and Human3.6M datasets, marking consistent performance gains. Notably, the approach exhibits robust error resilience to large 2D detection deviations, underscoring its practical applicability in dynamic environments. Additionally, performance was evaluated under various protocols to account for different training and testing conditions, maintaining competitive metrics.

The paper includes a qualitative assessment using the Leeds Sports Pose dataset, showcasing promising generalizability in images 'in the wild,' despite being trained exclusively on human laboratory-controlled datasets.

Implications and Future Directions

The implications of using distance matrices for 3D pose estimation are twofold. Practically, it affords greater robustness in real-world applications where joint occlusion and varied viewing angles are prevalent. Theoretically, it encourages exploration of EDMs beyond pose estimation into other domains, potentially harmonizing with problems reliant on relational representations of data.

Future avenues for this research could explore integrating richer network architectures or alternative learning strategies to further leverage synthesized data for improving generalizability. Additionally, incorporating motion continuity across image sequences might better capture temporal dynamics, extending applicability to video sequences or real-time applications.

In sum, this paper delineates an approach that not only enhances the precision of 3D human pose estimation but also broadens the computational techniques available to tackle ambiguities inherent in monocular vision systems.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (1)

Francesc Moreno-Noguer