Papers
Topics
Authors
Recent
2000 character limit reached

Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes (2104.07300v3)

Published 15 Apr 2021 in cs.CV

Abstract: We consider the problem of recovering a single person's 3D human mesh from in-the-wild crowded scenes. While much progress has been in 3D human mesh estimation, existing methods struggle when test input has crowded scenes. The first reason for the failure is a domain gap between training and testing data. A motion capture dataset, which provides accurate 3D labels for training, lacks crowd data and impedes a network from learning crowded scene-robust image features of a target person. The second reason is a feature processing that spatially averages the feature map of a localized bounding box containing multiple people. Averaging the whole feature map makes a target person's feature indistinguishable from others. We present 3DCrowdNet that firstly explicitly targets in-the-wild crowded scenes and estimates a robust 3D human mesh by addressing the above issues. First, we leverage 2D human pose estimation that does not require a motion capture dataset with 3D labels for training and does not suffer from the domain gap. Second, we propose a joint-based regressor that distinguishes a target person's feature from others. Our joint-based regressor preserves the spatial activation of a target by sampling features from the target's joint locations and regresses human model parameters. As a result, 3DCrowdNet learns target-focused features and effectively excludes the irrelevant features of nearby persons. We conduct experiments on various benchmarks and prove the robustness of 3DCrowdNet to the in-the-wild crowded scenes both quantitatively and qualitatively. The code is available at https://github.com/hongsukchoi/3DCrowdNet_RELEASE.

Citations (73)

Summary

  • The paper introduces 3DCrowdNet, a framework that addresses inter-person occlusions by integrating robust 2D pose estimators with a joint-based regressor.
  • It demonstrates significant improvements in metrics like MPJPE and PA-MPJPE on challenging datasets such as 3DPW-Crowd and MuPoTS.
  • The study highlights practical implications for surveillance, virtual reality, and human-computer interaction by accurately distinguishing overlapping individuals.

Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes

In the field of computer vision, the paper titled "Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes" addresses the significant challenge of reconstructing 3D human meshes in complex environments containing multiple individuals. Traditional methods for 3D human mesh estimation often underperform in crowded scenes due to inter-person occlusion and the domain gap between controlled datasets (such as MoCap) and in-the-wild images. This paper proposes a novel approach, 3DCrowdNet, to tackle these issues directly.

Methodology

The 3DCrowdNet framework is notable for its dual strategies: leveraging 2D pose estimations and a joint-based regressor to extract and distinguish features in crowded scenes.

  1. Utilizing 2D Pose Estimation: To address the mismatch between training and test domains, 3DCrowdNet uses 2D pose estimators that are trained on in-the-wild datasets and are inherently more robust to the domain gaps typical of crowded scenes. This offers a strong start point for identifying target individuals without requiring depth supervision from MoCap data.
  2. Joint-based Regressor: The proposed joint-based regressor preserves the spatial encoding of individuals in the feature maps, distinguishing individual features from occluded contexts, which is critical for crowded scenes. This regressor also capitalizes on image features corresponding to joint positions, which are sampled without losing spatial information, thus maintaining essential person-specific cues.

Experimental Results

Quantitative evaluations demonstrate the superior performance of 3DCrowdNet over existing methods. The paper reports significant improvements on datasets like 3DPW-Crowd, MuPoTS, and CMU-Panoptic, showing robustness in challenging environments. Metrics such as MPJPE and PA-MPJPE are used to showcase this advancement, highlighting a remarkable reduction in prediction errors. Compared to methods such as SPIN and ROMP, which struggle with domain gaps and occlusions, 3DCrowdNet maintains accuracy by effectively differentiating between overlapping individuals.

Theoretical and Practical Implications

Theoretically, this research contributes a novel perspective to the domain of human mesh recovery, emphasizing the importance of utilizing robust 2D pose estimators to bridge domain gaps. The joint-based regressor introduces a refined approach to feature discrimination in cluttered environments, advancing the comprehension of deep learning's potential in pose estimation tasks.

Practically, the improved accuracy and robustness of 3DCrowdNet can significantly enhance applications in surveillance, virtual reality, and human-computer interaction, where scene complexity is a common scenario.

Future Developments

The paper hints at an intriguing area of future work involving the integration of relative translation modeling between individuals in crowded scenes. Furthermore, enhancing data augmentation techniques to strengthen networks against similar appearances or occlusions promises further advancements in real-world applications.

The presented 3DCrowdNet provides a compelling argument for reevaluating traditional approaches to 3D human mesh estimation, with its focus on domain adaptation and feature discrimination likely to inspire additional research within the community.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.