Papers
Topics
Authors
Recent
2000 character limit reached

Putting People in their Place: Monocular Regression of 3D People in Depth (2112.08274v3)

Published 15 Dec 2021 in cs.CV

Abstract: Given an image with multiple people, our goal is to directly regress the pose and shape of all the people as well as their relative depth. Inferring the depth of a person in an image, however, is fundamentally ambiguous without knowing their height. This is particularly problematic when the scene contains people of very different sizes, e.g. from infants to adults. To solve this, we need several things. First, we develop a novel method to infer the poses and depth of multiple people in a single image. While previous work that estimates multiple people does so by reasoning in the image plane, our method, called BEV, adds an additional imaginary Bird's-Eye-View representation to explicitly reason about depth. BEV reasons simultaneously about body centers in the image and in depth and, by combing these, estimates 3D body position. Unlike prior work, BEV is a single-shot method that is end-to-end differentiable. Second, height varies with age, making it impossible to resolve depth without also estimating the age of people in the image. To do so, we exploit a 3D body model space that lets BEV infer shapes from infants to adults. Third, to train BEV, we need a new dataset. Specifically, we create a "Relative Human" (RH) dataset that includes age labels and relative depth relationships between the people in the images. Extensive experiments on RH and AGORA demonstrate the effectiveness of the model and training scheme. BEV outperforms existing methods on depth reasoning, child shape estimation, and robustness to occlusion. The code and dataset are released for research purposes.

Citations (143)

Summary

  • The paper demonstrates that integrating a bird’s-eye-view representation with monocular regression significantly improves 3D human pose, shape, and depth estimation.
  • It introduces an end-to-end BEV model that leverages both front and bird’s-eye views along with an age-aware SMPL+A template for robust depth reasoning.
  • Experiments on the RH, CMU Panoptic, and AGORA datasets show state-of-the-art performance, highlighting the method's effectiveness even under severe occlusions.

Monocular Regression of 3D People in Depth

The paper "Putting People in their Place: Monocular Regression of 3D People in Depth" addresses the challenges of estimating the pose, shape, and relative depth of multiple individuals from a single RGB image. The authors propose a novel method, referred to as BEV, which stands for Bird's-Eye-View. Unlike previous approaches that predominantly reason within the image plane, BEV incorporates an imaginary bird's-eye-view representation, effectively enhancing the network's ability to simultaneously consider body centers both in the image and depth dimensions. This method marks a significant shift from previous works, enabling a comprehensive estimation of 3D body positions.

Methodology

BEV is a one-stage, end-to-end differentiable approach that is distinct from multi-stage frameworks. It directly estimates 3D human body attributes by integrating depth reasoning into the model. One of the core contributions of this method is its handling of individuals' heights, crucial for accurate depth inference. The model leverages an age-aware 3D body template, known as SMPL+A, which accommodates variations from infants to adults. The architecture also includes innovative 3D representation via both front-view and bird's-eye-view maps, facilitating precise depth estimation even under challenging conditions like severe occlusion.

To train BEV, the authors introduce the Relative Human (RH) dataset. It is unique due to its weak annotations, which include depth layers and age groups, overcoming the difficulty of obtaining ground-truth height and depth data for diverse ages. This dataset allows the authors to develop a piece-wise depth layer loss and an ambiguity-compatible age loss, both pivotal for improving generalization and managing depth/height ambiguities.

Results and Implications

The BEV model shows substantial improvements over existing models in several benchmarks. On the RH dataset, BEV demonstrates superior accuracy in both depth reasoning and pose estimation compared to methods like CRMH and ROMP. Furthermore, on CMU Panoptic and AGORA datasets, BEV sets state-of-the-art results in 3D pose estimation and mesh reconstruction error. These results underline the importance of employing a robust 3D representation and tailored loss functions for advancing the state of human-centric depth reasoning.

Practically, this research contributes to enhancing applications that require accurate human modeling and interaction analysis in images, ranging from augmented reality to sophisticated surveillance systems. Theoretically, it also underscores the value of integrating different viewpoints and leveraging weak annotations to improve model robustness.

Future Directions

The findings suggest multiple avenues for future research. One potential development involves extending BEV to handle even larger crowds and more diverse demographic attributes beyond age and height, such as weight and gender. Moreover, exploring semi-supervised or unsupervised learning paradigms could further optimize the model’s efficacy without extensive labeled data. Lastly, the integration of BEV with temporal models might yield advances in video-based human modeling, allowing for dynamic interaction reasoning.

In conclusion, the BEV's novel approach to depth reasoning in monocular images represents a significant contribution to monocular 3D human pose estimation, setting a foundation for future explorations and practical implementations in AI applications.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.