DensePose: Dense Human Pose Estimation In The Wild (1802.00434v1)

Published 1 Feb 2018 in cs.CV

Abstract: In this work, we establish dense correspondences between RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We first gather dense correspondences for 50K persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence 'in the wild', namely in the presence of background, occlusions and scale variations. We improve our training set's effectiveness by training an 'inpainting' network that can fill in missing groundtruth values and report clear improvements with respect to the best results that would be achievable in the past. We experiment with fully-convolutional networks and region-based models and observe a superiority of the latter; we further improve accuracy through cascading, obtaining a system that delivers highly0accurate results in real time. Supplementary materials and videos are provided on the project page http://densepose.org

Citations (1,300)

View on Semantic Scholar

Summary

The paper introduces the DensePose-COCO dataset with over 5 million annotated correspondences that enhances training for dense human pose estimation.
The DensePose-RCNN algorithm integrates FCNs with Mask-RCNN principles to accurately map 3D human body surfaces from 2D images.
A cascading model with inpainting techniques delivers real-time processing speeds and robust performance even in challenging real-world scenes.

DensePose: Dense Human Pose Estimation In The Wild

"DensePose: Dense Human Pose Estimation In The Wild," authored by Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos, addresses the challenging task of establishing dense correspondences between RGB images and 3D representations of the human body. This paper makes significant contributions in terms of dataset creation, algorithm design, and performance improvements, particularly in real-world settings characterized by background noise, occlusions, and varying scales.

Key Contributions

Dataset Creation: DensePose-COCO
- The authors introduce DensePose-COCO, a meticulously annotated dataset derived from 50,000 instances within the COCO dataset. A novel two-stage annotation process enabled the collection of over 5 million manually annotated correspondences, vastly enhancing the available training data for dense human pose estimation.
- This process involved annotators identifying body parts and then mapping sampled points on the parts to corresponding points on a 3D surface representation, specifically the SMPL model.
Algorithmic Advances: DensePose-RCNN
- The DensePose-RCNN algorithm integrates fully convolutional networks (FCNs) and region-based convolutional neural network (R-CNN) approaches to achieve superior dense correspondence mapping.
- Notably, the algorithm introduces a region-based processing approach that employs Mask-RCNN's principles. This design choice enhances the system’s ability to handle complex scenes involving multiple individuals.
Performance Optimization: Cascading and Inpainting
- To further refine the prediction accuracy, the authors utilize a cascading model that incorporates keypoint detection and instance segmentation tasks, leveraging their synergies.
- The introduction of a distillation-based inpainting technique allows the sparse ground-truth annotations to be expanded into a denser and more effective training signal.

Evaluation and Results

Dataset Utility

The DensePose-COCO dataset was directly compared to surrogate datasets like SURREAL and Unite the People (UP)—which comprise synthetically generated and semi-automatically annotated data, respectively. The paper found that DensePose COCO significantly outperformed these alternative sources, highlighting the critical importance of high-quality, densely-annotated datasets for improving dense pose estimation accuracy.

Model Performance

DensePose-RCNN achieved marked improvements in precision and accuracy over previous models. This system demonstrated:

Processing speeds of 20-26 frames per second on 240×320 images and 4-5 frames per second on 800×1100 images using a GTX 1080 GPU.
Superior performance metrics, with dense correspondences discerned accurately even under occlusions and scale variations.

In addition to quantitative metrics, qualitative evaluations further emphasized the robustness of DensePose-RCNN. The system maintained high fidelity in UV mapping across diverse body poses and occlusions, showcasing its potential utility in real-world applications.

Theoretical and Practical Implications

This research's implications are multifold:

Theoretical: The paper posits a new benchmark for dense human pose estimation, moving beyond landmark-based methods. It indicates potential future extensions to other object categories, ultimately aiming for comprehensive 3D understanding across general objects.
Practical: The advancements in real-time processing pave the way for numerous practical applications, notably in augmented reality, computer graphics, and human-computer interaction. The ability to accurately map dense human poses in real time enables seamless integration of virtual and real-world elements.

Future Directions

Future research can build upon this work by exploring domain adaptation techniques to better leverage synthetic datasets, enhancing generalization across varied real-world scenarios. Additionally, further refinement of inpainting methods and the exploration of more intricate multi-task learning frameworks could yield incremental performance gains.

DensePose has laid the groundwork for subsequent studies focused on dense pose estimation and its applications. By establishing a powerful dataset and robust algorithms, this paper provides a foundational resource for future research aimed at bridging the gap between 2D image data and 3D semantic understanding.

PDF Markdown

Related Papers

Tweets

https://twitter.com/felix_red_panda/status/1788917423971700740

YouTube

Show All Videos