Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image (1907.11346v2)

Published 26 Jul 2019 in cs.CV

Abstract: Although significant improvement has been achieved recently in 3D human pose estimation, most of the previous methods only treat a single-person case. In this work, we firstly propose a fully learning-based, camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. The pipeline of the proposed system consists of human detection, absolute 3D human root localization, and root-relative 3D single-person pose estimation modules. Our system achieves comparable results with the state-of-the-art 3D single-person pose estimation models without any groundtruth information and significantly outperforms previous 3D multi-person pose estimation methods on publicly available datasets. The code is available in https://github.com/mks0601/3DMPPE_ROOTNET_RELEASE , https://github.com/mks0601/3DMPPE_POSENET_RELEASE.

Authors (3)

Gyeongsik Moon (31 papers)
Ju Yong Chang (14 papers)
Kyoung Mu Lee (107 papers)

Citations (304)

View on Semantic Scholar

Summary

The paper’s main contribution is a fully learning-based framework that localizes 3D human roots and estimates multi-person poses from a single RGB image.
The methodology integrates DetectNet for human detection, RootNet for absolute depth estimation, and PoseNet for refined root-relative pose predictions.
The approach achieves superior performance on benchmarks like MuPoTS-3D and competitive results on Human3.6M without using groundtruth data at test time.

Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image

Overview

The paper presents a novel approach to 3D multi-person pose estimation from a single RGB image utilizing a fully learning-based, camera distance-aware top-down methodology. This research addresses the limitations of previous works which predominantly focus on 3D single-person pose estimation by introducing a comprehensive framework capable of handling multiple persons without relying on groundtruth data during inference.

Methodology

The proposed system consists of three primary components:

DetectNet: A human detection network based on Mask R-CNN, responsible for detecting human bounding boxes in the input image.
RootNet: A novel 3D human root localization network estimating the absolute depth and camera-centered coordinates of human roots. The RootNet employs a corrective factor to enhance the reliability of depth estimation by refining the area associated with the bounding box in real-time.
PoseNet: A root-relative 3D single-person pose estimation network adapted from Sun et al.’s model, which outputs root-relative 3D poses from the cropped human images provided by DetectNet.

Results

The system demonstrates superior performance over previous 3D multi-person pose estimation methods on publicly available datasets, including MuPoTS-3D, by effectively estimating the absolute camera-centered 3D human keypoints. The evaluation on the Human3.6M dataset shows competitive results with state-of-the-art single-person models, achieving significant efficiency without groundtruth data during testing.

Implications

This research contributes a flexible and robust framework enabling the seamless integration of existing human detection and 3D pose estimation models. The effectiveness of RootNet in capturing accurate 3D root positions can be leveraged across various 3D computer vision applications beyond pose estimation, potentially extending to areas like 3D human mesh reconstruction.

Future Directions

Future research could enhance the accuracy of 3D human root localization by integrating insights from single-image depth estimation techniques. Additionally, adapting this framework for dynamic scenes with occlusions and variable lighting conditions could significantly broaden its applicability.

Overall, this work lays the groundwork for more extensive exploration into 3D multi-person analysis from monocular sources, addressing a crucial gap in the field's current methodologies.

PDF Markdown