PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision (2112.09290v2)

Published 17 Dec 2021 in cs.CV, cs.AI, cs.DB, cs.GR, and cs.LG

Abstract: In recent years, person detection and human pose estimation have made great strides, helped by large-scale labeled datasets. However, these datasets had no guarantees or analysis of human activities, poses, or context diversity. Additionally, privacy, legal, safety, and ethical concerns may limit the ability to collect more human data. An emerging alternative to real-world data that alleviates some of these issues is synthetic data. However, creation of synthetic data generators is incredibly challenging and prevents researchers from exploring their usefulness. Therefore, we release a human-centric synthetic data generator PeopleSansPeople which contains simulation-ready 3D human assets, a parameterized lighting and camera system, and generates 2D and 3D bounding box, instance and semantic segmentation, and COCO pose labels. Using PeopleSansPeople, we performed benchmark synthetic data training using a Detectron2 Keypoint R-CNN variant [1]. We found that pre-training a network using synthetic data and fine-tuning on various sizes of real-world data resulted in a keypoint AP increase of $+38.03$ ($44.43 \pm 0.17$ vs. $6.40$) for few-shot transfer (limited subsets of COCO-person train [2]), and an increase of $+1.47$ ($63.47 \pm 0.19$ vs. $62.00$) for abundant real data regimes, outperforming models trained with the same real data alone. We also found that our models outperformed those pre-trained with ImageNet with a keypoint AP increase of $+22.53$ ($44.43 \pm 0.17$ vs. $21.90$) for few-shot transfer and $+1.07$ ($63.47 \pm 0.19$ vs. $62.40$) for abundant real data regimes. This freely-available data generator should enable a wide range of research into the emerging field of simulation to real transfer learning in the critical area of human-centric computer vision.

PDF Abstract

Essay on "PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision"

The paper "PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision" presents a novel tool for generating synthetic data aimed at advancing research in the computer vision domain, particularly in tasks related to human detection and pose estimation. The authors, affiliated with Unity Technologies, introduce PeopleSansPeople, a sophisticated data generator capable of simulating highly varied human-centric datasets using Unity's rendering capabilities and Perception package.

Contributions and Methodology

The central contribution of this work is the development and release of PeopleSansPeople, a system designed to address several limitations in current human-centric datasets like privacy concerns, the complexity of annotation, and lack of diversity in poses and activities. It leverages advances in computer graphics to create large-scale, diverse datasets that offer rich annotations including 2D and 3D bounding boxes, semantic segmentation, and keypoints conforming to the COCO standard.

This tool uses a variety of domain randomization techniques to enhance the robustness and transferability of trained models from synthetic data to real-world tasks—a process known as sim2real transfer. Randomization is applied across several parameters such as lighting, camera angles, object poses, and textures, thereby increasing the generalization potential of the models trained on the generated datasets.

Numerical Results

Empirical validation of PeopleSansPeople shows promising enhancements in model performance for both bounding box and keypoint detection tasks. Specifically, pre-training a Detectron2 Keypoint R-CNN variant on synthetic data followed by fine-tuning on real-world datasets resulted in significant performance gains. For instance, with limited real data (few-shot settings), there was an observed keypoint AP increase of +38.03, while for more abundant real-world data, the increase was +1.47. These results were further compared against models pre-trained on ImageNet, with PeopleSansPeople-derived models showing superior performance across various data regimes.

Implications and Future Work

The introduction of PeopleSansPeople represents a critical step forward in computer vision research by providing a means to generate synthetic data that closely mimics the diversity and complexity of real-world scenarios. This advancement supports better scalability and generalization in human-centric computer vision models. The open-source nature of PeopleSansPeople and its integration with widely used platforms like Unity can facilitate broader adoption and spur further innovation in the field.

The authors note that while the current results are promising, further exploration into hyperparameter tuning, domain adaptation strategies, and other synthetic data generation techniques could yield even greater performance improvements. Additionally, exploring the use of PeopleSansPeople beyond task benchmarking to include other applications like augmented reality, surveillance, and human-computer interaction offers exciting potential for future research.

In conclusion, the PeopleSansPeople synthetic data generator is a significant addition to the tools available for research and development in human-centric computer vision. With its comprehensive approach to data synthesis, the paper sets a foundation for further advancements in understanding and bridging the sim2real gap, ultimately contributing to more robust and flexible computer vision models.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Salehe Erfanian Ebadi (4 papers)
You-Cyuan Jhang (10 papers)
Alex Zook (3 papers)
Saurav Dhakad (3 papers)
Adam Crespi (5 papers)
Pete Parisi (2 papers)
Steven Borkman (1 paper)
Jonathan Hogins (3 papers)
Sujoy Ganguly (9 papers)

Citations (17)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos