Learning from Synthetic Humans (1701.01370v3)

Published 5 Jan 2017 in cs.CV

Abstract: Estimating human pose, shape, and motion from images and videos are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.

PDF Abstract

Learning from Synthetic Humans: A Comprehensive Review

The paper "Learning from Synthetic Humans" addresses the significant challenge of human pose, shape, and motion estimation from images, a vital area intersecting computer vision and machine learning with numerous applications. Traditional methods have relied on large-scale, manually-labeled datasets to train Convolutional Neural Networks (CNNs). However, acquiring annotations, especially in 3D, is cumbersome and impractical. The authors propose a compelling solution in the form of SURREAL (Synthetic hUmans foR REAL tasks), a large-scale dataset with synthetically-generated yet realistic images derived from 3D motion capture sequences. This dataset introduces a paradigm where synthetic data efficiently substitutes or complements real-world annotated data, enabling advancements in depth estimation and human part segmentation.

Overview of SURREAL Dataset and Methodology

Central to this research is the SURREAL dataset, consisting of over 6 million frames, each annotated with ground truth data, including poses, depth maps, and segmentation masks. The dataset distinguishes itself through high realism achieved via SMPL body models fitted with MoSh to ensure accurate kinematic and shape deformations. Key aspects of data generation encompass:

Body Shape and Pose: Employing the SMPL body model, shapes were generated using principal components derived from the CAESAR dataset, while poses were animated using CMU MoCap data.
Texturing and Lighting: Two sets of textures from CAESAR and clothed 3D scans ensure variation, with spherical harmonics providing realistic lighting.
Backgrounds: Diverse indoor settings from the LSUN dataset provided realistic, clutter-free backgrounds.

The dataset's richness is further enhanced by variable viewpoints and camera parameters, ensuring a comprehensive representation of potential real-world scenarios.

Training and Results

The experiments conducted address human part segmentation and depth estimation using adaptations of the stacked hourglass network architecture, originally formulated for pose estimation. The networks were trained on the synthetic dataset and evaluated across several benchmarks:

Synthetic Data Performance: On the synthetic test set, the networks achieved significant IOU scores (69.13%) and pixel accuracies (80.61%) for segmentation. Depth estimation yielded RMSE and st-RMSE values of 72.9mm and 56.3mm, respectively.
Real-World Evaluation on FSitting Dataset: Pre-training on synthetic data proved beneficial, and fine-tuning on limited real-world data (Freiburg Sitting People dataset) further enhanced performance, underscoring the pre-trained model's capability to generalize effectively.
Comprehensive Analysis using Human3.6M: Evaluations on the Human3.6M dataset showcased the CNN's effectiveness post fine-tuning, outperforming the baseline trained solely on real data. The improvements in segmentation IOU (54.30%) and depth RMSE (90.0mm) after fine-tuning validate the synthetic pre-training's robustness.

Implications and Future Directions

The research demonstrates the substantial potential of synthetic datasets in training deep learning models for vision tasks traditionally reliant on large annotated datasets. The implications for practical applications are myriad, including enhanced efficiency in developing pose estimation systems for augmented reality, robotics, and surveillance.

Theoretically, the results push the boundary on the synthetic-to-real domain adaptation, opening avenues for further exploration on leveraging synthetic data to complement real-world learning. Future work could explore integrating more intricate elements like dynamic lighting and interactive backgrounds, augmenting with multiple human figures, and modeling complex occlusions to push the synthetic realism boundary even further.

Conclusion

"SURREAL" offers a pivotal contribution to computational human analysis by demonstrating that synthetic data can be a viable surrogate to real-world annotations, significantly reducing the dependency on manually-labeled datasets. This work lays the foundation for future advancements in human analysis, facilitating research and development in domains requiring rich understanding of human poses and movements without the bottleneck of extensive manual data annotation.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Gül Varol (39 papers)
Javier Romero (35 papers)
Xavier Martin (5 papers)
Naureen Mahmood (2 papers)
Michael J. Black (163 papers)
Ivan Laptev (99 papers)
Cordelia Schmid (206 papers)

Citations (934)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos