Effective Dynamics of Generative Adversarial Networks (2212.04580v1)

Published 8 Dec 2022 in cond-mat.dis-nn, cs.LG, and stat.ML

Abstract: Generative adversarial networks (GANs) are a class of machine-learning models that use adversarial training to generate new samples with the same (potentially very complex) statistics as the training samples. One major form of training failure, known as mode collapse, involves the generator failing to reproduce the full diversity of modes in the target probability distribution. Here, we present an effective model of GAN training, which captures the learning dynamics by replacing the generator neural network with a collection of particles in the output space; particles are coupled by a universal kernel valid for certain wide neural networks and high-dimensional inputs. The generality of our simplified model allows us to study the conditions under which mode collapse occurs. Indeed, experiments which vary the effective kernel of the generator reveal a mode collapse transition, the shape of which can be related to the type of discriminator through the frequency principle. Further, we find that gradient regularizers of intermediate strengths can optimally yield convergence through critical damping of the generator dynamics. Our effective GAN model thus provides an interpretable physical framework for understanding and improving adversarial training.

PDF Abstract

Generative Adversarial Networks (GANs) are powerful models for learning complex probability distributions, but they are known to be challenging to train. A common failure mode is mode collapse, where the generator focuses on producing samples from only a subset of the modes in the target distribution, failing to capture its full diversity. This paper (Durr et al., 2022 ) investigates mode collapse from a dynamical systems perspective by introducing a simplified, interpretable model that replaces the generator network with a collection of particles in data-space.

The standard GAN objective involves a generator $G_\theta$ and a discriminator $D_\phi$ , trained adversarially. The discriminator tries to distinguish real data samples ( $x \sim p(x)$ ) from generated samples ( $G_\theta(z) \sim p_\theta(x)$ , where $z \sim q(z)$ ), while the generator tries to fool the discriminator. The objective often takes the form:

$\mathcal{L}(\phi, \theta) \equiv \langle D_\phi(x) \rangle_{x\sim p(x)} - \langle D_\phi(G_\theta(z)) \rangle_{z \sim q(z)} - \lambda R(D_\phi)$

where $R(D_\phi)$ is a regularizer on the discriminator. Training typically involves alternating gradient ascent on $\phi$ and gradient descent on $\theta$ . Mode collapse manifests as the generator's distribution oscillating between different modes of the data.

Instead of tracking generator parameters $\theta$ , this paper studies the dynamics of the generator's outputs, $X_t = G_{\theta_t}(z)$ , treated as particles in data-space. The evolution of these particles is related to the generator parameter updates through the Jacobian of the generator and the Neural Tangent Kernel (NTK), $\Gamma_{\theta}(z, z') = \sum_k \frac{dG^i(z)}{d\theta_k} \frac{dG^j(z')}{d\theta_k}$ . For infinite-width neural networks, the NTK becomes static, fixed at initialization.

The paper proposes a simplified NTK structure for a system of $N$ particles $\{X_a\}_{a=1}^N$ in data-space, abstracting away the latent space inputs $\{z_a\}$ . This coarse-grained NTK is defined as:

$\Gamma^{i, j}_{a, b} = \delta_{i, j}\left(g_1 \delta_{a, b} + g_2 (1-\delta_{a, b})\right)$

Here, $\delta_{i, j}$ implies no correlation between gradients across different output dimensions, and the NTK value depends only on whether the particles are the same ( $a=b$ , value $g_1$ ) or distinct ( $a \ne b$ , value $g_2$ ). This simplification is motivated by NTK properties of wide networks with certain activations (like ReLU) when latent inputs are sampled from a high-dimensional sphere.

Under this simplified NTK, the dynamics of particle $X_a$ follows from the generator's objective (minimizing the discriminator's expected output):

$\frac{d X^i_a}{dt} \propto \frac{1}{N} \sum^{d, N}_{j, b} \Gamma^{i, j}_{a, b} \nabla_{x_j} D(X_b)$

Substituting the simplified NTK, this becomes:

$\frac{d X_a}{dt} = \alpha_G\left( \frac{g_1-g_2}{N}\nabla_x D(X_{a}) + g_2 \langle \nabla_x D(X)\rangle \right)$

where $\langle \nabla_x D(X)\rangle = \frac{1}{N}\sum_{b=1}^N \nabla_x D(X_b)$ is the average discriminator gradient over all particles. This equation highlights a key dynamic: each particle's velocity is a combination of its local discriminator gradient and the average gradient across the entire particle ensemble, weighted by terms related to $g_1$ and $g_2$ .

To paper mode collapse, the paper applies this model to a 2D toy problem: generating samples from a distribution of 8 Gaussians arranged in a circle, a common benchmark for mode collapse. The "generator" is the set of 200 particles, initialized as a Gaussian cluster. The "discriminator" is a single-hidden-layer neural network. Training alternates between updating the discriminator parameters via gradient ascent on the objective and updating the particle positions via the derived gradient dynamics (Algorithm \ref{alg:gan_training_2}).

The paper quantifies mode collapse using a metric based on the entropy of the distribution of particles assigned to the nearest mode (Equation \ref{eq:mode_collapse_metric}). A low metric value (near 0) indicates particles are distributed across all 8 modes (convergence), while a high value (near $\log 8$ ) indicates collapse to a single mode.

Experiments varying the NTK ratio $g_2/g_1$ and the discriminator's relative training time ( $n_{disc}$ , the number of discriminator steps per generator step) reveal a transition boundary between convergence and mode collapse.

When $g_2/g_1=0$ (diagonal NTK), particles evolve independently based on local gradients, and the system converges to cover all modes (Figure \ref{fig:no_ntk_2d}).
When $g_2/g_1$ is sufficiently large (e.g., 1/5), the average gradient term dominates, causing the entire particle cluster to move together, chasing discriminator minima from mode to mode – the signature of mode collapse (Figure \ref{fig:with_ntk_2d}).

The paper relates the shape of this boundary to the discriminator's ability to learn "high-frequency" spatial features. To break apart a particle cluster, the discriminator needs to create a minimum near the cluster's center, which requires learning finer spatial details (higher frequencies). The spatial scale required is proportional to $(g_1 - g_2)/g_2$ . The time (or $n_{disc}$ ) needed for the discriminator to learn such features depends on its frequency-dependent learning rate, $\gamma(k)$ . For ReLU discriminators, $\gamma(k)$ is known to be power-law, leading to a power-law boundary in the $g_2/g_1$ vs $n_{disc}$ plane (Figure \ref{fig:relu_data}). For Tanh discriminators, $\gamma(k)$ is exponential, resulting in an exponential boundary (Figure \ref{fig:tanh_data}). This suggests that the Frequency Principle in neural networks plays a direct role in GAN convergence behavior.

The paper also shows how the model can incorporate regularizers. A kinetic energy regularizer on generator parameters, $\beta ||\nabla_\theta \langle D(G_\theta (z))\rangle ||^2 /2$ , translates to a term involving the NTK and discriminator gradients (Equation \ref{eq:reg}). This term acts analogously to damping in a physical system. Adding this regularizer to the model GAN with a collapsing $g_2/g_1$ value successfully restores convergence (Figure \ref{fig:reg_dyn_scatters}). Varying the regularization strength $\beta$ reveals under-, critically-, and over-regularized regimes, mirroring damping dynamics (Figure \ref{fig:reg_dyn}), suggesting that finding the optimal regularization strength is crucial in practice.

Practical Implications and Implementation:

Interpreting Dynamics: The particle model offers a simplified, visualizable way to understand the complex adversarial dynamics of GANs and why mode collapse occurs due to correlated particle movement.
Diagnosing Mode Collapse: The competition between local and average gradients ( $\frac{g_1-g_2}{N}\nabla_x D(X_{a})$ vs $g_2 \langle \nabla_x D(X)\rangle$ ) provides an intuitive explanation for mode collapse: when the global influence ( $g_2$ ) is too strong relative to the local influence ( $(g_1-g_2)/N$ ) or when the discriminator cannot sufficiently minimize the average gradient over a region, particles fail to split and cover multiple modes.
Improving Training: The findings suggest practical strategies:
- Modifying Generator Architecture: Architectures yielding a smaller $g_2/g_1$ ratio in their infinite-width NTK might be inherently more stable against mode collapse. While directly calculating NTK is hard, design choices could implicitly affect this ratio.
- Discriminator Training Speed: The relative learning rates of the generator and discriminator ( $\alpha_G$ , $\alpha_D$ , $n_{disc}$ ) are critical. Sufficient discriminator training is needed to learn fine-grained features and break particle clusters.
- Discriminator Properties: Discriminators with faster learning rates for high-frequency functions could potentially mitigate mode collapse more effectively, especially when particle correlation ( $g_2/g_1$ ) is high.
- Regularization: Regularizers acting on generator gradients (explicitly or implicitly via the discriminator objective) can act like damping, stabilizing training and preventing oscillatory mode switching. Tuning the strength is important to avoid over-regularization.
Implementing the Model: Simulating this model involves:
- Representing the generator distribution by $N$ points in data-space.
- Implementing the chosen discriminator network (e.g., simple ReLU/Tanh MLP).
- Implementing the training loop: sample real data, calculate discriminator loss, update discriminator via gradient ascent. Then, calculate discriminator gradients for all particles, compute the average gradient, and update particle positions using the derived dynamics equation incorporating the simplified NTK terms $(g_1, g_2)$ . Regularization terms would be added to the discriminator's objective before its update step.
- Calculating the mode collapse metric by assigning particles to the nearest mode and computing the entropy of the mode distribution.
Limitations: The model relies on significant simplifications (static NTK, fixed $g_1, g_2$ , fixed particle set representing the distribution) that might not fully capture the complexities of real GAN training with dynamic NTKs, varying batch samples, and the full generator network parameter space. However, its value lies in providing an interpretable lower-dimensional system to paper fundamental dynamics.

In summary, the paper provides a valuable theoretical framework grounded in physics concepts (dynamical systems, NTK, damping) to understand mode collapse in GANs. The proposed particle model simplifies the problem while retaining key properties, allowing for concrete experiments demonstrating how internal dynamics governed by the NTK and the discriminator's properties (learning speed, frequency principle) drive the transition to mode collapse. This work provides principles that could guide the development of more stable GAN architectures and training algorithms.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Steven Durr (3 papers)
Youssef Mroueh (66 papers)
Yuhai Tu (36 papers)
Shenshen Wang (16 papers)

Citations (3)

View on Semantic Scholar

Effective Dynamics of Generative Adversarial Networks (2212.04580v1)

Related Papers