Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image (1907.11346v2)

Published 26 Jul 2019 in cs.CV

Abstract: Although significant improvement has been achieved recently in 3D human pose estimation, most of the previous methods only treat a single-person case. In this work, we firstly propose a fully learning-based, camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. The pipeline of the proposed system consists of human detection, absolute 3D human root localization, and root-relative 3D single-person pose estimation modules. Our system achieves comparable results with the state-of-the-art 3D single-person pose estimation models without any groundtruth information and significantly outperforms previous 3D multi-person pose estimation methods on publicly available datasets. The code is available in https://github.com/mks0601/3DMPPE_ROOTNET_RELEASE , https://github.com/mks0601/3DMPPE_POSENET_RELEASE.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Gyeongsik Moon (31 papers)
  2. Ju Yong Chang (14 papers)
  3. Kyoung Mu Lee (107 papers)
Citations (304)

Summary

  • The paper’s main contribution is a fully learning-based framework that localizes 3D human roots and estimates multi-person poses from a single RGB image.
  • The methodology integrates DetectNet for human detection, RootNet for absolute depth estimation, and PoseNet for refined root-relative pose predictions.
  • The approach achieves superior performance on benchmarks like MuPoTS-3D and competitive results on Human3.6M without using groundtruth data at test time.

Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image

Overview

The paper presents a novel approach to 3D multi-person pose estimation from a single RGB image utilizing a fully learning-based, camera distance-aware top-down methodology. This research addresses the limitations of previous works which predominantly focus on 3D single-person pose estimation by introducing a comprehensive framework capable of handling multiple persons without relying on groundtruth data during inference.

Methodology

The proposed system consists of three primary components:

  1. DetectNet: A human detection network based on Mask R-CNN, responsible for detecting human bounding boxes in the input image.
  2. RootNet: A novel 3D human root localization network estimating the absolute depth and camera-centered coordinates of human roots. The RootNet employs a corrective factor to enhance the reliability of depth estimation by refining the area associated with the bounding box in real-time.
  3. PoseNet: A root-relative 3D single-person pose estimation network adapted from Sun et al.’s model, which outputs root-relative 3D poses from the cropped human images provided by DetectNet.

Results

The system demonstrates superior performance over previous 3D multi-person pose estimation methods on publicly available datasets, including MuPoTS-3D, by effectively estimating the absolute camera-centered 3D human keypoints. The evaluation on the Human3.6M dataset shows competitive results with state-of-the-art single-person models, achieving significant efficiency without groundtruth data during testing.

Implications

This research contributes a flexible and robust framework enabling the seamless integration of existing human detection and 3D pose estimation models. The effectiveness of RootNet in capturing accurate 3D root positions can be leveraged across various 3D computer vision applications beyond pose estimation, potentially extending to areas like 3D human mesh reconstruction.

Future Directions

Future research could enhance the accuracy of 3D human root localization by integrating insights from single-image depth estimation techniques. Additionally, adapting this framework for dynamic scenes with occlusions and variable lighting conditions could significantly broaden its applicability.

Overall, this work lays the groundwork for more extensive exploration into 3D multi-person analysis from monocular sources, addressing a crucial gap in the field's current methodologies.