Papers
Topics
Authors
Recent
Search
2000 character limit reached

Progressive Structure-Conditional GANs

Updated 10 February 2026
  • The paper introduces a novel GAN architecture that integrates multi-scale pose information at every stage to ensure consistent full-body image synthesis.
  • It employs a progressive growing paradigm with a fade-in mechanism, concatenating downsampled pose maps to guide both generator and discriminator outputs.
  • Empirical evaluation shows PSGAN delivers sharper detail and improved pose fidelity compared to unconditional and one-shot pose transfer methods.

Progressive Structure-Conditional Generative Adversarial Networks (PSGANs) are a specialized generative architecture developed to synthesize high-resolution, full-body, structurally consistent character images by integrating multi-scale structural (pose) information within a progressive training scheme. PSGAN advances the state-of-the-art in conditional generative modeling by enforcing spatial alignment between synthesized image content and supplied pose maps at all resolutions, targeting the joint challenges of photo-realistic detail and pose fidelity in single-image and video generation settings, as demonstrated in full-resolution anime character synthesis (Hamada et al., 2018).

1. Architectural Framework

PSGANs are architecturally grounded in the progressive growing paradigm originally introduced by Karras et al. (2018), but extend this methodology by incorporating explicit structure-conditional signals at every stage of both the generator and discriminator. At each spatial resolution n×nn \times n, a downsampled pose map SnS_n—constructed from high-resolution keypoint maps—is concatenated along the channel axis to the network's feature maps.

Generator

  • Input: latent vector zN(0,I)z \sim \mathcal{N}(0, I), typically 512-dimensional.
  • Initial layer: a 4×\times4 learned "constant" block.
  • Upsampling: each stage applies nearest-neighbor upsampling followed by two conv–Leaky-ReLU blocks (structure: 3×\times3 conv \to Leaky ReLU \to 3×\times3 conv \to Leaky ReLU).
  • Structure Conditioning: after every conv block at resolution nn, the MM-channel pose map SnS_n (from full-res S1024S_{1024} via repeated max pooling) is concatenated to the feature tensor.
  • Fade-in: transitions between resolutions use linear blending between coarse and fine paths to ensure training stability.

Discriminator

  • Mirrors the generator’s structure in reverse, progressively downsampling inputs.
  • At every scale, the corresponding pose map SnS_n is concatenated with the image features prior to convolutions.
  • Output: a scalar critic score D(x,S)D(x, S) (no activation).
  • All blocks are standard Progressive GAN “ConvBlocks”; no use of conditional BatchNorm or SPADE layers.

2. Progressive Training and Structural Conditioning

The PSGAN is trained via progressive resolution doubling: starting at 4×44 \times 4, then 8×88 \times 8, up to 1024×10241024 \times 1024. Each resolution stage proceeds through 600,000 images (real plus fake), split equally between a fade-in phase (to transition to higher resolution) and a stabilization phase.

At every resolution, the structure condition input is kept commensurate: for example, the 128×128128 \times 128 generator/discriminator stages are fed with S128S_{128} created by downsampling S1024S_{1024} the requisite number of times. This parallel schedule ensures that global pose alignment is learned in low-resolution stages, while higher-resolution layers refine texture—enforcing retention of pose information throughout network depth.

Injecting structural conditions at every scale prevents “pose forgetting,” a phenomenon where the network regresses toward semantically inaccurate yet visually plausible images once fine-scale information dominates, as observed in unconditional Progressive GAN baselines. The fade-in mechanism at each scale mitigates destabilization seen when introducing new layers abruptly.

3. Objective Functions and Optimization

Training stabilizes the adversarial optimization using the WGAN-GP objective [Gulrajani et al., 2017]. The losses are formally:

Discriminator (Critic) Loss:

LD=E(x,S)pdata[D(x,S)]Ezp(z),S[D(G(z,S),S)]+λEx^p^(x^D(x^,S)21)2L_{D} = \mathbb{E}_{(x,S)\sim p_\mathrm{data}}[D(x,S)] - \mathbb{E}_{z\sim p(z),\,S}[D(G(z,S), S)] + \lambda\,\mathbb{E}_{\hat x \sim \hat p}\Bigl(\lVert\nabla_{\hat x}D(\hat x, S)\rVert_{2}-1\Bigr)^{2}

where λ=10\lambda = 10, and x^\hat x is interpolated between real and generated images.

Generator Loss:

LG=Ezp(z),S[D(G(z,S),S)]L_{G} = -\mathbb{E}_{z\sim p(z),\,S}[D(G(z,S), S)]

No auxiliary losses (e.g., L1L_1, perceptual, or feature-matching) are introduced. The update ratio is ncritic=1n_\mathrm{critic} = 1 (one critic update per generator step).

Optimization uses Adam with β1=0\beta_1=0, β2=0.99\beta_2=0.99; learning rates schedule from α=0.001\alpha=0.001 at low resolutions to α=0.0001\alpha=0.0001 at 1024×10241024 \times 1024. Batch sizes decrease with resolution due to memory constraints (16 at 1282128^2, 2 at 102421024^2).

4. Dataset Construction and Structural Encoding

PSGAN introduces a bespoke dataset for evaluation:

Avatar Anime-Character Dataset

  • Source: 69 Unity 3D character “outfits;” each animated through 600 unique poses for a reported total of 47,400 images (possible overcount due to additional actions).
  • Resolution: all renders at 1024×10241024 \times 1024 on a uniform white background.
  • Pose Data: exact 2D pose keypoints (M=20M=20 channels) derived directly from Unity rig bones, eliminating detection noise.
  • Pose Maps: for each keypoint, the pose map channel is +1+1 at the bone-root pixel and 1-1 elsewhere. Lower-resolution maps are generated on-the-fly via max pooling.

Baseline: DeepFashion

  • 52,712 real images at 256×256256\times256 (In-shop Clothes Retrieval).
  • OpenPose estimated 18 keypoints; images with fewer than 10 detected are excluded.
  • Pose maps use identical +1+1/–1 encoding.

Preprocessing consists of background removal (for synthetic data), channel stacking, and dynamic (or cached) generation of pose maps at each required resolution.

5. Empirical Evaluation and Comparative Analysis

Structural Consistency

On DeepFashion (256×256256\times256), Progressive GAN without pose information consistently fails to generate anatomically valid full-body images, frequently misplacing limbs or amputating extremities. PSGAN, with multi-scale pose conditioning, preserves alignment and structure throughout synthesis.

Image Fidelity

Compared to PG2 (Ma et al., 2017)—a one-shot supervised pose transfer—PSGAN’s unsupervised, latent-based approach yields substantially sharper edge detail and more realistic shading, especially along clothing boundaries.

Quantitative Metrics

No FID or Inception Score values are reported; assessments are qualitative. Authors state that PSGAN achieves higher fidelity and structural consistency than baselines across tests.

Video Generation

By varying the structural condition StS_t (temporal pose sequence) and keeping zz fixed, PSGAN produces temporally coherent, smooth 1024×10241024\times1024 resolution character animations.

6. Implementation Characteristics and Design Rationale

PSGAN retains the conventional Progressive GAN convolutional/fade-in pipeline, substituting pose-map channel concatenation for more elaborate conditional fusion (e.g., conditional BatchNorm, SPADE). This leverages the spatial alignment inherent in pose maps, obviating specialized normalization or attention modules. All network blocks are as in the standard Progressive GAN.

Training is robust across resolutions using WGAN-GP and fade-in curricula. Batch size reductions at higher scales reflect practical hardware limits. The authors observe that global structural pose emerges in coarse layers, while localized features (texture, shading) are introduced progressively at subsequent higher resolutions.

7. Context, Significance, and Limitations

PSGAN addresses the persistent challenge of pose-conditioned generative modeling at high resolutions, where single-scale or late-fusion conditioning struggles to enforce global pose through deep networks. By coordinating structure at every scale, PSGAN attains unprecedented consistency for full-body synthesis and animation tasks in domains such as anime character and human image generation.

While detailed quantitative benchmarking (e.g., FID/Inception) is absent, qualitative results position PSGAN as superior to both unconditional progressive training (for structural accuracy) and task-specific one-shot pipelines (for visual sharpness). A plausible implication is that the multi-scale conditioning paradigm could be further extended to other generative domains where spatial structure is critical.

The reliance on synthetic datasets with ground-truth pose simplifies conditioning but may limit direct generalization to unconstrained real-world imagery due to keypoint detection noise. The architecture also prioritizes spatial alignment over semantic controllability; expansion to richer conditional signals remains an open research direction (Hamada et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Progressive Structure-Conditional GANs.