3D Human Pose Estimation = 2D Pose Estimation + Matching (1612.06524v2)

Published 20 Dec 2016 in cs.CV

Abstract: We explore 3D human pose estimation from a single RGB image. While many approaches try to directly predict 3D pose from image measurements, we explore a simple architecture that reasons through intermediate 2D pose predictions. Our approach is based on two key observations (1) Deep neural nets have revolutionized 2D pose estimation, producing accurate 2D predictions even for poses with self occlusions. (2) Big-data sets of 3D mocap data are now readily available, making it tempting to lift predicted 2D poses to 3D through simple memorization (e.g., nearest neighbors). The resulting architecture is trivial to implement with off-the-shelf 2D pose estimation systems and 3D mocap libraries. Importantly, we demonstrate that such methods outperform almost all state-of-the-art 3D pose estimation systems, most of which directly try to regress 3D pose from 2D measurements.

Authors (2)

Ching-Hang Chen (4 papers)
Deva Ramanan (152 papers)

Citations (515)

View on Semantic Scholar

Summary

The paper introduces a novel method that decouples 3D pose estimation into reliable 2D keypoint detection and an efficient 3D matching process.
The approach leverages deep neural networks for 2D estimation and a nearest-neighbor matching algorithm to achieve state-of-the-art results on the Human3.6M benchmark.
The method delivers low inference times and robust generalization, offering promising applications in action recognition and human-robot interaction.

3D Human Pose Estimation = 2D Pose Estimation + Matching

Ching-Hang Chen and Deva Ramanan from Carnegie Mellon University propose a novel method for 3D human pose estimation from single RGB images, using an intermediate 2D pose estimation step followed by a matching mechanism with 3D poses. Their approach is predicated on the success of deep neural networks in accurate 2D pose estimation and the availability of substantial 3D motion capture datasets.

Introduction and Motivation

The task of 3D human pose estimation from images is well-established, with applications spanning from action recognition to human-robot interaction. Traditional approaches often require advanced sensors or multiple cameras. The authors focus on a more challenging yet practical scenario—inferring 3D pose from a single 2D image.

Methodology

The proposed method comprises a two-step process:

2D Pose Estimation: Utilizing state-of-the-art deep learning techniques like convolutional pose machines (CPMs), the system first predicts 2D keypoints. These methods have significantly improved in handling occlusions, making them reliable for initial pose estimation.
3D Pose Matching: The system employs a non-parametric, nearest-neighbor approach to map the intermediate 2D pose to a 3D pose from a predefined library. By generating a large set of virtual 2D projections from available 3D poses, it uses the best match to estimate depth, thus lifting the 2D poses to 3D.

Evaluation and Results

The approach is evaluated on Human3.6M, a widely used benchmark dataset. The authors report thorough evaluations following different protocols to ensure compatibility with existing literature. Their method showcases notable improvements over previously established techniques, with results indicating competitive performance, even surpassing systems reliant on more complex operations.

Specifically, the approach shows that matching exemplars without warping is already competitive, but introducing a simple warping algorithm significantly reduces error. The method is both efficient, with sub-200ms inference time, and provides state-of-the-art results on Human3.6M.

Generalization and Implications

One crucial advantage highlighted is the system's ability to generalize across diverse datasets, thanks to its reliance on 2D intermediate representations trained on large, varied data. This modular design enables adapting to further advances in 2D pose estimation without requiring fundamental retraining of the 3D stage.

Future Directions

The paper hints at potential improvements through more extensive 3D libraries and enhanced 2D pose estimation techniques. The authors also suggest exploring this methodology’s applicability in more dynamic or uncontrolled environments.

Conclusion

Chen and Ramanan present a method that effectively decouples the challenges of 3D estimation into more manageable steps leveraged by existing technologies in 2D pose estimation. Their findings underscore the pivotal role of robust 2D pose estimates in achieving reliable 3D reconstructions, offering an exciting direction for future research and application in computer vision.

PDF Markdown