Progressive Growing of GANs

Updated 3 December 2025

The technique progressively extends both generator and discriminator networks, allowing synthesis of fine-scale details with improved stability and realism.
A smooth fade-in mechanism blends new and existing layers during resolution transitions, supported by equalized learning rates and pixelwise normalization.
Adaptations across domains like medical imaging and video synthesis demonstrate P-GAN's versatility and effectiveness in handling high-dimensional data.

Progressive Growing of Generative Adversarial Networks (P-GAN) is a training methodology in which a GAN's generator and discriminator are both repeatedly extended by new layers, allowing the networks to synthesize increasingly fine-scale detail as resolution is gradually increased. This approach, popularized by Karras et al., has led to marked advances in high-resolution image synthesis and has been adapted into medical imaging, video synthesis, physical layout generation, and other domains. By decomposing the generative task into a sequence of sub-tasks that tackle ever-greater spatial (and sometimes temporal) granularity, P-GAN improves stability, realism, and variation while remaining tractable for modern hardware.

1. Foundational Principles and Progressive Growing Strategy

At the core of P-GAN is the progressive layering of both generator ( $G$ ) and discriminator ( $D$ ), who begin operating at a minimal spatial scale (typically $4 \times 4$ or $8 \times 8$ ). After initial convergence at this low resolution, a new block is appended to both $G$ and $D$ , doubling the working resolution (e.g., $8 \to 16$ , $16 \to 32$ , etc.) (Karras et al., 2017, Beers et al., 2018, Wen et al., 2019).

During each transition, the networks employ a linear fade-in mechanism controlled by a scalar $\alpha \in [0,1]$ , which blends activations from the old layers (upsampled path) and new block (convolution path) according to

$\text{output}_{2R} = \alpha \cdot \text{new\_path}_{2R} + (1 - \alpha) \cdot \text{old\_path}_{2R}$

with $\alpha$ increasing over a prescribed number of images or iterations (Karras et al., 2017, Wen et al., 2019). Once $\alpha=1$ , training proceeds on the fully grown network for "stabilization" before the next resolution increase.

This smooth transition is crucial to retain previously learned coarse features and to prevent destabilization as resolution increases.

2. Architectural Components and Training Workflow

The generator structure at each stage consists of an upsampling operation, followed by pairs of $3\times3$ convolutional layers, pixelwise feature normalization, and nonlinear activation (usually LeakyReLU). Correspondingly, the discriminator mirrors this structure with $3\times3$ convolutions followed by average pooling (downsampling), a minibatch standard deviation layer for mode collapse mitigation, and final fully connected layers for scalar output (Karras et al., 2017, Beers et al., 2018, Eklund, 2019).

A weight scaling ("equalized learning rate") mechanism is employed per layer, standardizing the dynamic range across layers. The feature map scheduling typically follows:

Resolution	Generator: Feature Maps	Discriminator: Feature Maps
4x4	512	512
8x8	512	512
16x16	512	512
32x32	512	512
64x64	256	256
128x128	128	128
256x256	64	64
512x512	32	32
1024x1024	16	16

Pixelwise feature normalization takes place after each activation in the generator:

$b_{x,y,i} = \frac{a_{x,y,i}}{\sqrt{\frac{1}{N} \sum_{j=1}^N a_{x,y,j}^2 + \epsilon}}$

where $\epsilon=10^{-8}$ and $N$ is the number of channels.

For variants focused on computational efficiency, all standard convolutions may be replaced by depthwise-separable convolution blocks, resulting in markedly reduced multiply–accumulate counts and approximately a $2\times$ speed-up per training epoch at $64\times64$ resolution (Karwande et al., 2022).

3. Loss Functions, Training Schedules, and Stability Mechanisms

P-GAN almost universally employs the Wasserstein GAN loss with gradient penalty (WGAN-GP):

$L_D = \mathbb{E}_{\tilde{x}\sim\mathbb{P}_g}[D(\tilde{x})] - \mathbb{E}_{x\sim\mathbb{P}_r}[D(x)] + \lambda\,\mathbb{E}_{\hat{x}\sim\mathbb{P}_{\hat{x}}}\left(\|\nabla_{\hat{x}}D(\hat{x})\|_{2}-1\right)^2$

$L_G = -\mathbb{E}_{z\sim p(z)}\left[D\left(G(z)\right)\right]$

where $\lambda$ is a penalty coefficient, $p(z)$ is the latent prior, and $\mathbb{P}_r$ , $\mathbb{P}_g$ denote real and generated distributions (Karras et al., 2017, Wen et al., 2019, Karwande et al., 2022). The optimization utilizes Adam, commonly with $\beta_1=0$ , $\beta_2=0.99$ , and dynamically adjusted batch sizes to fit available GPU memory.

Stability-enabling tricks include pixelwise feature vector normalization in $G$ and per-feature minibatch standard deviation aggregation in $D$ , which mitigates unhealthy generator/discriminator competition and mode collapse (Karras et al., 2017, Beers et al., 2018).

In settings where progression is truncated at intermediate resolutions, outputs are post-processed by a super-resolution GAN (SRGAN), which is independently trained to upsample images. The composite SRGAN loss is

$L_{SR} = L_{content} + \beta \, L_{adv,SR}$

with perceptual content loss based on VGG features and adversarial loss with a small $\beta$ weight (Karwande et al., 2022).

4. Domain Adaptations and Extensions

The progressive growing methodology has been extended well beyond natural image synthesis. In medical imaging, segmentation maps are incorporated as extra input or output channels in $G$ and $D$ , enabling the networks to synthesize anatomical and pathological detail at native resolution (Beers et al., 2018, Liang et al., 2020). In video synthesis, 3D convolutional blocks are progressively grown in both spatial and temporal directions, with dedicated fade-in schedules for each dimension (Acharya et al., 2018, Aigner et al., 2018).

In physical layout optimization (metasurfaces), conditional inputs (e.g. wavelength, deflection angle) are embedded and concatenated to the noise vector $z$ before feeding to $G$ , enabling parametric image generation. Progressive training-set refinement is used (see "UpdateTrainingSet" pseudocode in (Wen et al., 2019)), yielding significant computational cost reductions over conventional topology optimization.

For three-dimensional neuroimaging synthesis, all operations are replaced by 3D variants, channel counts are reduced to fit memory constraints, and data volumes are cropped and upsampled appropriately (Eklund, 2019).

5. Evaluation Metrics, Experimental Results, and Benchmarks

Progressive growing improves both image fidelity and diversity. Key evaluation metrics include Sliced Wasserstein Distance (SWD), Fréchet Inception Distance (FID), Mean Structural SIMilarity index (MS-SSIM), and application-specific measures (AUC on vessels, dice score on segmentations, deflection efficiency for metasurfaces):

Method	SWD (patch)	FID	MS-SSIM	Inception Score (IS)
P-GAN (full model)	2.96e-3	8.34	0.2828	8.80 (CIFAR-10)
Sketch-guided PGSGAN	--	54.94	0.4895	--
P-GAN + SRGAN (CelebA)	381.86	--	0.1698	2.138 ± 0.130

User studies and downstream segmentation tasks confirm that realism and utility are substantially increased relative to both non-progressive GANs and conventional adversarial architectures (Liang et al., 2020, Beers et al., 2018).

Training costs depend heavily on resolution and architecture. For $64\times64$ images, DS Convs halve the per-epoch time; for $512\times512$ synthesis, staging progressive growth avoids days of training times at the highest resolutions (Karwande et al., 2022, Beers et al., 2018).

6. Practical Considerations and Implementation Guidelines

A range of practical issues must be addressed for efficient and stable P-GAN deployment:

GPU memory consumption increases rapidly with resolution; reductions in batch size and freezing of early layers may be necessary (Beers et al., 2018, Eklund, 2019).
Fade-in schedules should be sufficiently long (e.g., 20,000 batches or more per phase) to avoid destabilization (Beers et al., 2018).
Equalized learning rate and pixelwise normalization are preferred over batchnorm (Karras et al., 2017).
Poor segmentation or conditional labels will degrade output quality in conditional and segmentation-aware variants (Beers et al., 2018).
Transfer learning is facilitated by retaining low-resolution weights and restarting progression at an appropriate stage for a new dataset (Beers et al., 2018).
Task-specific output and conditioning strategies can be implemented provided the progressive schedule is respected (Wen et al., 2019).

7. Extensions, Limitations, and Future Directions

Current adaptations of progressive growing cover: image synthesis up to $1024^2$ resolution (Karras et al., 2017), medical image domains with auxiliary segmentation (Beers et al., 2018), ultrasound and other modalities with sketch guidance (Liang et al., 2020), 3D neuroimaging synthesis up to $64^3$ voxels (Eklund, 2019), and video sequences with spatio-temporal resolution rising jointly (Acharya et al., 2018, Aigner et al., 2018).

Reported limitations include hardware-imposed resolution ceilings (e.g., $64^3$ for 3D volumes), lack of quantitative metrics in certain domains (neuroimaging, video), and eventual saturation in image realism at extreme scales. Future work encompasses extending P-GAN to conditional domains, enhancing training-set refinement, and pushing resolution and temporal scales further with distributed computation and improved memory management (Eklund, 2019, Acharya et al., 2018, Wen et al., 2019).

Through a curriculum-based regime of incremental resolution growth, progressive GANs have provided a stable and high-fidelity foundation for generative modeling in a wide array of high-dimensional domains.