The Unusual Effectiveness of Averaging in GAN Training (1806.04498v2)

Published 12 Jun 2018 in stat.ML, cs.CV, and cs.LG

Abstract: We examine two different techniques for parameter averaging in GAN training. Moving Average (MA) computes the time-average of parameters, whereas Exponential Moving Average (EMA) computes an exponentially discounted sum. Whilst MA is known to lead to convergence in bilinear settings, we provide the -- to our knowledge -- first theoretical arguments in support of EMA. We show that EMA converges to limit cycles around the equilibrium with vanishing amplitude as the discount parameter approaches one for simple bilinear games and also enhances the stability of general GAN training. We establish experimentally that both techniques are strikingly effective in the non-convex-concave GAN setting as well. Both improve inception and FID scores on different architectures and for different GAN objectives. We provide comprehensive experimental results across a range of datasets -- mixture of Gaussians, CIFAR-10, STL-10, CelebA and ImageNet -- to demonstrate its effectiveness. We achieve state-of-the-art results on CIFAR-10 and produce clean CelebA face images.\footnote{~The code is available at \url{https://github.com/yasinyazici/EMA_GAN}}

Citations (168)

View on Semantic Scholar

Summary

The Unusual Effectiveness of Averaging in GAN Training

This paper presents a theoretical and experimental analysis of two techniques for parameter averaging in Generative Adversarial Network (GAN) training: Moving Average (MA) and Exponential Moving Average (EMA). The main objective is to investigate the effectiveness of these averaging methods in improving GAN training stability and performance.

GANs are established as two-player zero-sum games that often suffer from training instability and non-convergence. The traditional approach to GAN optimization either seeks more stable function families or utilizes alternative objectives, but non-convergence issues persist due to cyclic behaviors around optimal solutions. The paper addresses such cyclic behaviors by applying simple strategies for parameter averaging, specifically outside of the adversarial training loop, thus not altering the game dynamics.

Theoretical Contributions

The authors provide the first theoretical insights into the EMA technique, showing its effect on bilinear games. In simple bilinear settings, EMA does not lead to convergence to equilibrium; instead, it stabilizes cyclic behaviors by shrinking their amplitude. The paper demonstrates that in non-bilinear settings, EMA preserves the stability of locally stable fixed points. This theoretical perspective adds a layer of understanding into why EMA can yield better training stability even when convergence to a strict equilibrium is not theoretically guaranteed.

Experimental Findings

The paper empirically evaluates the performance of both MA and EMA across a diverse set of GANs with different architectures and objectives on various datasets including CIFAR-10, STL-10, CelebA, and ImageNet. The experiments clearly showcase that both MA and EMA lead to improvements in standard GAN metrics such as inception scores and Fréchet Inception Distance (FID). EMA consistently provides greater and more reliable benefits compared to MA, which may suffer due to equal weight averaging over long periods resulting in poorer performance with time-varying iterates.

The paper also compares these averaging methods against other techniques like Consensus Optimization, Optimistic Adam, and Zero-centered Gradient Penalty, demonstrating that averaging is unusually effective, alleviating the cycling and non-convergence challenges often seen in GAN training. Specifically, EMA improves results considerably across diverse experimental settings, further confirming its unusual efficacy.

Implications and Future Directions

While the theoretical framework is mostly limited to bilinear models and local stability constructs, the implications are valuable in practical contexts where GAN applications demand robust training strategies to achieve better visual quality and stability. The findings suggest that parameter averaging, particularly EMA, could serve as a simple yet powerful enhancement to existing optimization protocols in GAN training.

The paper's results warrant further exploration into different configurations of the EMA discount factor to optimize its performance across various GAN models and tasks. This could involve more comprehensive hyperparameter studies to fine-tune EMA's application in conditional GAN setups or larger-scale generative contexts.

In sum, this research provides a significant contribution to understanding and improving GAN training methodologies, offering a strong case for the inclusion of parameter averaging strategies as an effective option in the GAN training toolkit. It paves the way for future explorations into more complex games beyond bilinear models to deepen our understanding of GAN dynamics and improve the field of generative modeling.