Learning from Simulated and Unsupervised Images through Adversarial Training (1612.07828v2)

Published 22 Dec 2016 in cs.CV, cs.LG, and cs.NE

Abstract: With recent progress in graphics, it has become more tractable to train models on synthetic images, potentially avoiding the need for expensive annotations. However, learning from synthetic images may not achieve the desired performance due to a gap between synthetic and real image distributions. To reduce this gap, we propose Simulated+Unsupervised (S+U) learning, where the task is to learn a model to improve the realism of a simulator's output using unlabeled real data, while preserving the annotation information from the simulator. We develop a method for S+U learning that uses an adversarial network similar to Generative Adversarial Networks (GANs), but with synthetic images as inputs instead of random vectors. We make several key modifications to the standard GAN algorithm to preserve annotations, avoid artifacts, and stabilize training: (i) a 'self-regularization' term, (ii) a local adversarial loss, and (iii) updating the discriminator using a history of refined images. We show that this enables generation of highly realistic images, which we demonstrate both qualitatively and with a user study. We quantitatively evaluate the generated images by training models for gaze estimation and hand pose estimation. We show a significant improvement over using synthetic images, and achieve state-of-the-art results on the MPIIGaze dataset without any labeled real data.

PDF Abstract

Learning from Simulated and Unsupervised Images through Adversarial Training

The paper "Learning from Simulated and Unsupervised Images through Adversarial Training" presents a method to refine synthetic images using unsupervised real images, while preserving the annotation information from the synthetic data, termed Simulated+Unsupervised (S+U) learning. The key advance in this paper is the development of a system called SimGAN, which improves the realism of synthetic images using an adversarial network, akin to Generative Adversarial Networks (GANs), to bridge the gap between synthetic and real image distributions.

Overview of The SimGAN Framework

The SimGAN framework refines synthetic images generated by a simulator through a neural network called the refiner network. This network is trained using a combination of adversarial loss and a self-regularization loss.

Adversarial Loss: A discriminator network is trained to distinguish between real and refined images. The refiner network is optimized to fool the discriminator into classifying refined images as real, thereby improving realism.
Self-Regularization Loss: To ensure that the refinements do not alter the annotation information critical for training predictive models, the self-regularization term minimizes the per-pixel difference between the synthetic and refined images.

The discriminator and refiner networks are updated alternately following a minimax game approach, typical of GANs.

Key Innovations

Self-Regularization: SimGAN introduces a self-regularization term to ensure that the important annotation information from the synthetic images remains unchanged.
Local Adversarial Loss: The discriminator operates on local patches rather than the whole image, preventing the refiner network from introducing unrealistic artifacts and ensuring locally consistent realism.
History of Refined Images: Stability in training is further enhanced by updating the discriminator using a buffer of previously refined images, ensuring that the discriminator does not forget earlier refinements and helps in avoiding divergence.

Experimental Validation

The efficacy of SimGAN is demonstrated through qualitative and quantitative evaluations on two tasks: appearance-based gaze estimation and hand pose estimation from depth images.

Appearance-Based Gaze Estimation

Synthetic images for training were derived from the UnityEyes simulator while real images were obtained from the MPIIGaze dataset. The refined synthetic images exhibited significant qualitative improvements, closely matching the texture and characteristics of real images. Quantitatively, a gaze estimation model trained on the refined images achieved a considerable reduction in error compared to one trained on the original synthetic images, demonstrating the effectiveness of the refinement process.

Hand Pose Estimation from Depth Images

The method was tested on the NYU hand pose dataset. The refined depth images successfully captured the noise characteristics and depth discontinuities observed in the real depth images. The hand pose estimation network, when trained on these refined images, outperformed models trained on both original synthetic and real images.

Numerical Results and Contributions

SimGAN achieved state-of-the-art results on both the MPIIGaze and NYU hand pose datasets without using any labeled real data:

For gaze estimation, training on the refined synthetic images led to a mean error reduction on the MPIIGaze dataset from 11.2 to 7.8 degrees.
For hand pose estimation, the refined images resulted in an 8.8% improvement in accuracy within 5 pixels distance from the ground truth compared to models trained on real images.

Implications and Future Directions

The implications of this work are multifold:

Practical: The ability to refine synthetic images using unsupervised real data can significantly reduce the dependency on extensive labeled datasets, thus expediting training processes in fields where obtaining labeled data is challenging or expensive.
Theoretical: The introduction of local adversarial loss and the buffer of refined images opens new avenues in stabilizing the training of generative models and addressing overfitting to synthetic details.
Speculative Future Developments: Future directions include extending SimGAN to handle other types of image data such as videos, further improving the refinement quality by modeling the noise distribution, and exploring applications in different domains such as medical imaging, autonomous driving, and robotics.

In conclusion, the SimGAN framework represents a notable advancement in leveraging adversarial training to enhance synthetic data realism, which holds significant potential across various domains of AI and machine learning.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Ashish Shrivastava (22 papers)
Tomas Pfister (89 papers)
Oncel Tuzel (62 papers)
Josh Susskind (37 papers)
Wenda Wang (2 papers)
Russ Webb (16 papers)

Citations (1,769)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos