Progressive Structure-Conditional GANs
- The paper introduces a novel GAN architecture that integrates multi-scale pose information at every stage to ensure consistent full-body image synthesis.
- It employs a progressive growing paradigm with a fade-in mechanism, concatenating downsampled pose maps to guide both generator and discriminator outputs.
- Empirical evaluation shows PSGAN delivers sharper detail and improved pose fidelity compared to unconditional and one-shot pose transfer methods.
Progressive Structure-Conditional Generative Adversarial Networks (PSGANs) are a specialized generative architecture developed to synthesize high-resolution, full-body, structurally consistent character images by integrating multi-scale structural (pose) information within a progressive training scheme. PSGAN advances the state-of-the-art in conditional generative modeling by enforcing spatial alignment between synthesized image content and supplied pose maps at all resolutions, targeting the joint challenges of photo-realistic detail and pose fidelity in single-image and video generation settings, as demonstrated in full-resolution anime character synthesis (Hamada et al., 2018).
1. Architectural Framework
PSGANs are architecturally grounded in the progressive growing paradigm originally introduced by Karras et al. (2018), but extend this methodology by incorporating explicit structure-conditional signals at every stage of both the generator and discriminator. At each spatial resolution , a downsampled pose map —constructed from high-resolution keypoint maps—is concatenated along the channel axis to the network's feature maps.
Generator
- Input: latent vector , typically 512-dimensional.
- Initial layer: a 44 learned "constant" block.
- Upsampling: each stage applies nearest-neighbor upsampling followed by two conv–Leaky-ReLU blocks (structure: 33 conv Leaky ReLU 33 conv Leaky ReLU).
- Structure Conditioning: after every conv block at resolution , the -channel pose map (from full-res via repeated max pooling) is concatenated to the feature tensor.
- Fade-in: transitions between resolutions use linear blending between coarse and fine paths to ensure training stability.
Discriminator
- Mirrors the generator’s structure in reverse, progressively downsampling inputs.
- At every scale, the corresponding pose map is concatenated with the image features prior to convolutions.
- Output: a scalar critic score (no activation).
- All blocks are standard Progressive GAN “ConvBlocks”; no use of conditional BatchNorm or SPADE layers.
2. Progressive Training and Structural Conditioning
The PSGAN is trained via progressive resolution doubling: starting at , then , up to . Each resolution stage proceeds through 600,000 images (real plus fake), split equally between a fade-in phase (to transition to higher resolution) and a stabilization phase.
At every resolution, the structure condition input is kept commensurate: for example, the generator/discriminator stages are fed with created by downsampling the requisite number of times. This parallel schedule ensures that global pose alignment is learned in low-resolution stages, while higher-resolution layers refine texture—enforcing retention of pose information throughout network depth.
Injecting structural conditions at every scale prevents “pose forgetting,” a phenomenon where the network regresses toward semantically inaccurate yet visually plausible images once fine-scale information dominates, as observed in unconditional Progressive GAN baselines. The fade-in mechanism at each scale mitigates destabilization seen when introducing new layers abruptly.
3. Objective Functions and Optimization
Training stabilizes the adversarial optimization using the WGAN-GP objective [Gulrajani et al., 2017]. The losses are formally:
Discriminator (Critic) Loss:
where , and is interpolated between real and generated images.
Generator Loss:
No auxiliary losses (e.g., , perceptual, or feature-matching) are introduced. The update ratio is (one critic update per generator step).
Optimization uses Adam with , ; learning rates schedule from at low resolutions to at . Batch sizes decrease with resolution due to memory constraints (16 at , 2 at ).
4. Dataset Construction and Structural Encoding
PSGAN introduces a bespoke dataset for evaluation:
Avatar Anime-Character Dataset
- Source: 69 Unity 3D character “outfits;” each animated through 600 unique poses for a reported total of 47,400 images (possible overcount due to additional actions).
- Resolution: all renders at on a uniform white background.
- Pose Data: exact 2D pose keypoints ( channels) derived directly from Unity rig bones, eliminating detection noise.
- Pose Maps: for each keypoint, the pose map channel is at the bone-root pixel and elsewhere. Lower-resolution maps are generated on-the-fly via max pooling.
Baseline: DeepFashion
- 52,712 real images at (In-shop Clothes Retrieval).
- OpenPose estimated 18 keypoints; images with fewer than 10 detected are excluded.
- Pose maps use identical /–1 encoding.
Preprocessing consists of background removal (for synthetic data), channel stacking, and dynamic (or cached) generation of pose maps at each required resolution.
5. Empirical Evaluation and Comparative Analysis
Structural Consistency
On DeepFashion (), Progressive GAN without pose information consistently fails to generate anatomically valid full-body images, frequently misplacing limbs or amputating extremities. PSGAN, with multi-scale pose conditioning, preserves alignment and structure throughout synthesis.
Image Fidelity
Compared to PG2 (Ma et al., 2017)—a one-shot supervised pose transfer—PSGAN’s unsupervised, latent-based approach yields substantially sharper edge detail and more realistic shading, especially along clothing boundaries.
Quantitative Metrics
No FID or Inception Score values are reported; assessments are qualitative. Authors state that PSGAN achieves higher fidelity and structural consistency than baselines across tests.
Video Generation
By varying the structural condition (temporal pose sequence) and keeping fixed, PSGAN produces temporally coherent, smooth resolution character animations.
6. Implementation Characteristics and Design Rationale
PSGAN retains the conventional Progressive GAN convolutional/fade-in pipeline, substituting pose-map channel concatenation for more elaborate conditional fusion (e.g., conditional BatchNorm, SPADE). This leverages the spatial alignment inherent in pose maps, obviating specialized normalization or attention modules. All network blocks are as in the standard Progressive GAN.
Training is robust across resolutions using WGAN-GP and fade-in curricula. Batch size reductions at higher scales reflect practical hardware limits. The authors observe that global structural pose emerges in coarse layers, while localized features (texture, shading) are introduced progressively at subsequent higher resolutions.
7. Context, Significance, and Limitations
PSGAN addresses the persistent challenge of pose-conditioned generative modeling at high resolutions, where single-scale or late-fusion conditioning struggles to enforce global pose through deep networks. By coordinating structure at every scale, PSGAN attains unprecedented consistency for full-body synthesis and animation tasks in domains such as anime character and human image generation.
While detailed quantitative benchmarking (e.g., FID/Inception) is absent, qualitative results position PSGAN as superior to both unconditional progressive training (for structural accuracy) and task-specific one-shot pipelines (for visual sharpness). A plausible implication is that the multi-scale conditioning paradigm could be further extended to other generative domains where spatial structure is critical.
The reliance on synthetic datasets with ground-truth pose simplifies conditioning but may limit direct generalization to unconstrained real-world imagery due to keypoint detection noise. The architecture also prioritizes spatial alignment over semantic controllability; expansion to richer conditional signals remains an open research direction (Hamada et al., 2018).