SofGAN: Controllable Portrait Generation
- SofGAN is a GAN-based portrait generator leveraging dual latent spaces to decouple and precisely control facial 3D geometry and texture.
- It employs a Semantic Occupancy Field (SOF) for explicit 3D geometric segmentation and a Semantic Instance-Wise (SIW) module for region-specific stylization.
- The architecture delivers high-quality, photorealistic outputs with low FID and improved LPIPS, supporting facial animation, interactive editing, and cross-domain generalization.
SofGAN is a portrait image generator based on generative adversarial networks (GANs) that achieves disentangled and dynamic control of facial geometric and textural attributes. The model introduces a dual-latent architecture and a Semantic Occupancy Field (SOF) for explicit 3D geometry representation, combined with a Semantic Instance-Wise (SIW) texturing approach for region-specific stylization. SofGAN enables the synthesis of high-quality, photorealistic, and diverse portraits with precise and independent manipulation of pose, shape, and texture characteristics (Chen et al., 2020).
1. Architectural Overview
SofGAN’s generator is partitioned into two tightly coordinated yet independent branches: one for geometry and one for texture. The latent space is split accordingly:
- zᵍ (geometry latent code): Encodes 3D spatial structure and part semantics of the portrait.
- zᵗ (texture latent code): Encodes appearance details, including color, fine-grained style, and texture.
The geometry branch employs a hyper-network that maps zᵍ to the weights of a Multilayer Perceptron (MLP) representing the Semantic Occupancy Field (SOF). The SOF encodes a volumetric, canonical-pose geometry together with dense semantic labeling.
The texture branch processes zᵗ through a basis transformation and the SIW module (an extension of StyleGAN2), which modulates textures region-specifically using the segmentation maps rendered from the SOF.
The complete image synthesis operation is expressed as:
where:
- is the SIW generator,
- is a learned mapping for the texture code,
- constructs semantic 3D geometry,
- is a differentiable renderer producing 2D segmentation maps under camera pose .
2. Semantic Occupancy Field (SOF)
The SOF is a neural implicit field that, for each , predicts a -dimensional probability vector of semantic class membership, providing a dense and continuous geometric representation associated with semantic segmentation. The network is structured as:
- (geometry features, via MLP)
- (semantic logits with softmax activation)
Rendering 2D segmentation maps involves casting rays through the SOF, employing a differentiable ray-marching procedure that, for each pixel:
- Samples points on the ray,
- Accumulates occupancy probabilities to determine the first ‘hit’ on the surface,
- Reads the semantic class probabilities at the surface location.
This mechanism guarantees segmentation maps that are consistent across arbitrary viewpoints, facilitating free-viewpoint synthesis and editing.
3. Semantic Instance-Wise (SIW) Module
SIW enables flexible, region-level style modulation by integrating semantic segmentation into the StyleGAN2 architectural paradigm. The module operates as follows:
- For each semantic region (), distinct per-region style vectors are derived from the texture latent code .
- Convolutions in the generator are performed separately per region:
with , where is a modulation computed via , and is the one-hot mask for region .
A mixed-style training scheme blends two style codes using a semantic-aware distance map :
where stem from SPADE normalization layers, enforcing spatial adaptivity. This facilitates smooth transitions at semantic boundaries and precise region-specific control.
4. Disentanglement, Control, and Applications
The explicit geometry–texture decoupling in SofGAN enables a spectrum of controllable image synthesis tasks:
- Facial Animation and Free-Viewpoint Rendering: 3D-aware geometry allows pose and expression changes without affecting region-wise textural appearance. Identity-consistent rendering is maintained over arbitrary camera views.
- Dynamic Regional Styling: The SIW module supports independent manipulation of semantic regions such as hair, eyes, or clothing, enabling interactive style editing, style mixing, and appearance morphing.
- Interactive and Incomplete Input Generation: Robustness to incomplete or hand-drawn segmentation maps enables creative and user-driven applications in portrait composition or special effects.
- Cross-Domain Generalization: By disentangling 3D geometry (trained on aligned scans) and texture (from unpaired natural images), SofGAN generalizes well to new datasets (e.g., FFHQ, CelebAMask-HQ), and tasks such as age/gender morphing or facial reenactment.
5. Mathematical Formulations
Key formulations describing SofGAN’s operations are as follows:
Component | Formula/Mapping | Description |
---|---|---|
Image Generation | Outputs photo given texture and geometry | |
SOF Mapping | : 3D point semantic probs | |
Ray-Marched Surface | Points along camera ray for SOF sampling | |
SIW Modulation | Region-specific style convolution | |
Mixed-Style Blending | Blending different styles via region map |
These structural and loss definitions underpin the disentanglement and realism achieved by SofGAN.
6. Experimental Evaluation and Performance
SofGAN has undergone extensive quantitative and qualitative evaluation:
- Quality (FID, LPIPS): Achieves lower Fréchet Inception Distance and improved LPIPS over baselines such as StyleGAN2, SPADE, SEAN, and Pix2PixHD, confirming both perceptual realism and diversity.
- Geometric Fidelity (mIOU): SOF-generated segmentation maps achieve high mean Intersection-over-Union scores across novel views, indicating robust, ground-truth-consistent geometry encoding.
- Region-Level Control: Ablation demonstrates that SIW and mixed-style training reduce artifacts at semantic borders and provide improved region consistency.
- Free-Viewpoint Consistency: Changing the input camera direction results in synthesized images that preserve identity and style, demonstrating 3D semantic awareness.
- Manipulation and Editing: Regional manipulations (e.g., aging, expression changes) are cleanly localized, with other attributes unaffected.
7. Significance and Implications
SofGAN represents an advance in controllable portrait image generation, specifically addressing the entanglement problem in GAN-based facial synthesis. By leveraging an explicit 3D semantic occupancy field and regionally adaptive stylization, SofGAN:
- Enables explicit, interpretable, and interactive control over both geometry and texture.
- Provides a unified platform for photo-realistic facial editing, animation, and novel view synthesis.
- Sets a new benchmark in terms of both image quality and attribute disentanglement, as reflected by standardized metrics and user-driven manipulation capabilities.
This architectural paradigm demonstrates the efficacy of integrating geometry-aware volumetric representations with GAN-based region-conditioned stylization, contributing to controllable, high-fidelity generative modeling of human portraits (Chen et al., 2020).