Papers
Topics
Authors
Recent
Search
2000 character limit reached

Aggregated Contextual-Transformation GAN

Updated 10 December 2025
  • The paper demonstrates AOT-GAN's core contribution by using a multi-scale AOT block with split-transform-aggregate modules and gated residual connections for effective context integration.
  • The model achieves significant empirical gains, including a 38.6% reduction in FID and marked improvements in PSNR and SSIM for image inpainting, outperforming methods like GatedConv.
  • The architecture flexibly combines attention mechanisms, diverse loss functions, and tailored generator-discriminator designs to deliver robust performance in both visual reconstruction and radio map synthesis.

Aggregated Contextual-Transformation GAN (AOT-GAN) refers to a class of generative adversarial network (GAN) architectures featuring the Aggregated Contextual-Transformation (AOT) block, a split-transform-aggregate module that enhances multi-scale context reasoning and texture synthesis. Originally introduced for high-resolution image inpainting, the AOT-GAN paradigm has also been extended to related tasks in spatial map reconstruction, demonstrating strong empirical performance in both visual and signal domains (Zeng et al., 2021, Qi et al., 2024).

1. Core AOT Block: Multi-Scale Contextual Transformation

The AOT block is central to all AOT-GAN variants. It is designed to aggregate contextual cues from multiple receptive fields in a parameter-efficient manner.

  • Splitting: The input tensor XRH×W×CX \in \mathbb{R}^{H \times W \times C} is divided into BB channel-wise branches (e.g., B=4B=4, giving C/BC/B channels per branch).
  • Contextual Transformation: Each branch applies a 3×33\times 3 convolution with a distinct dilation rate rbr_b (e.g., {1,2,4,8}\{1,2,4,8\}), capturing both local and long-range features.
  • Merging: The outputs are concatenated (yielding RH×W×C\mathbb{R}^{H \times W \times C}) and fused with a 1×11\times1 or 3×33\times3 convolution.
  • Gated Residual Connection: A learned spatial gate g=sigmoid(Conv3×3(X))g = \mathrm{sigmoid}(\mathrm{Conv}_{3\times3}(X)) blends the aggregated features and the identity path:

AOT(X)=gZ+(1g)X\mathrm{AOT}(X) = g \odot Z + (1-g) \odot X

where ZZ is the transformed feature and \odot denotes elementwise product (Zeng et al., 2021, Qi et al., 2024).

This procedure enables the network to selectively combine global and local context adaptively per spatial location and channel without requiring normalization inside the block.

2. Generator and Discriminator Architectures

The specific architecture depends on the application domain but maintains consistent use of the AOT backbone.

Image Inpainting (Zeng et al., 2021):

  • Encoder: Three 3×33\times3 conv layers with stride $2$, reducing resolution to 128×128128\times128 for 512×512512\times512 input.
  • AOT Backbone: Stack of eight AOT blocks, no normalization layers.
  • Decoder: Three transposed conv layers (stride $2$), upsampling to the original resolution, finalized by a 3×33\times3 conv for RGB reconstruction.

Radio Map Reconstruction (Qi et al., 2024):

  • U-Net Style Generator: Seven encoder and seven decoder layers, each encoder includes CONV, batch norm, ReLU, CBAM, and AOT block. Decoders employ T-Conv, batch norm, and CAT stacking with skip connections.
  • Output: 1×11\times1 conv leads to a single channel for path-loss estimation.

Discriminators:

  • Both use PatchGAN-style fully convolutional architectures, with the inpainting model using an SM-PatchGAN variant that outputs a soft label map guided by a mask, while the radio map model adopts a standard 70×7070\times70 PatchGAN.

3. Training Objectives and Loss Functions

The generator is trained to produce outputs that are visually coherent (for images) or accurate and sharp (for radio maps), leveraging a mixture of adversarial and perceptual criteria.

Loss Type Image Inpainting (Zeng et al., 2021) Radio Map (Qi et al., 2024)
Reconstruction Lrec=xG(x(1m),m)1L_{\text{rec}} = \|x - G(x\odot(1-m),m)\|_{1} LMSE=E[yy^22]\mathcal{L}_{\rm MSE} = \mathbb{E}[\|y-\hat{y}\|_2^2]
Perceptual i(1/Ni)ϕi(x)ϕi(z)1\sum_i (1/N_i)\|\phi_i(x)-\phi_i(z)\|_1 i(1/Ni)Φi(y)Φi(y^)1\sum_i (1/N_i)\|\Phi_i(y)-\Phi_i(\hat{y})\|_1
Style iGi(x)Gi(z)1\sum_i \|G_i(x)-G_i(z)\|_1 iG(Φi(y))G(Φi(y^))1\sum_i \|G(\Phi_i(y)) - G(\Phi_i(\hat{y}))\|_1
Adversarial Least squares with mask-weighted soft target Standard GAN loss (cross-entropy)

Loss weights: λadv=0.01\lambda_{\rm adv}=0.01, λrec=1\lambda_{\rm rec}=1, λper=0.1\lambda_{\rm per}=0.1, λsty=250\lambda_{\rm sty}=250 in both frameworks.

4. Implementation Details

Datasets and Masks:

  • Image Inpainting: Places2 (1.8M images), CELEBA-HQ, QMUL-OpenLogo, with free-form masks and/or bounding boxes.
  • Radio Map: RadioMapSeer dataset (56,000 samples, urban layouts and transmitter positions).

Optimization:

  • Adam optimizer with initial learning rate 1×1041\times10^{-4}, β1=0\beta_1=0, β2=0.9\beta_2=0.9 (image), and decay by factor of $10$ after 50 epochs (radio).
  • Pre-trained VGG19 is used for perceptual and style losses in both models.

Training Regime:

  • Image inpainting: Batch size $8$, 100\approx 100k iterations.
  • Radio map: Batch size $16$, 100 epochs.

5. Empirical Results and Comparative Performance

Image inpainting (Zeng et al., 2021):

  • On Places2, for hole mask ratios $50$–60%60\% (challenging regime), AOT-GAN achieves FID $20.20$ (38.6%-38.6\% relative to GatedConv's $32.90$), PSNR $19.01$ (+2.07+2.07 dB), and SSIM $0.682$ (+9%+9\% relative).
  • User study: AOT-GAN preferred over PConv in 95.7%95.7\% (CELEBA-HQ) and 97.7%97.7\% (Places2), and over GatedConv in 86.6%86.6\%/95.1%95.1\% of cases. Even chosen over real images $17.4$–22.5%22.5\% of the time.
  • Ablations confirm the importance of multi-branch AOT construction and SM-PatchGAN for best FID ($7.37$, CELEBA-HQ).

Radio map construction (Qi et al., 2024):

  • Dense observations: 14.6%14.6\% improvement in RMSE over RadioUNet at threshold τ=0.2\tau=0.2.
  • Sparse sampling: Average RMSE reduction of 13.2%13.2\% compared to RadioUNet across varying regimes.
  • Unknown transmitter location: With Q=150Q=150 samples, ACT-GAN achieves RMSE $0.0106$ (vs $0.0119$), and $0.88$ m mean error in TX localization (34.3%34.3\% reduction).
  • Enhanced robustness to measurement noise and sharper reconstruction of multipath effects and building edges.

6. Architectural Impact and Extensions

The AOT block's split-transform-aggregate-residual design systematically enhances both global reasoning and fine-grained synthesis. In image inpainting, this mechanism is critical for reconstructing large, irregularly-shaped missing regions while preserving complex structures and textures (Zeng et al., 2021). In radio-mapping, the same module enables accurate and physically plausible field reconstructions, crucial for urban digital twin applications (Qi et al., 2024).

The CBAM attention mechanism further focuses model capacity on informative regions in radio map synthesis, while T-Convs ensure texture fidelity during upsampling. The generalization of AOT-GAN from vision to signal domain underscores the flexibility and power of aggregated contextual transformations for a wide class of conditional generation problems.

7. Applications and Evaluation Domains

  • Visual Inpainting: Completes missing regions in complex natural images, enabling logo removal, facial attribute editing, and arbitrary object removal on high-resolution data.
  • Radio Map Construction: Efficiently infers electromagnetic field distributions in urban layouts, suitable for network planning, coverage analysis, and source localization.
  • Ablation and User Study: Rigorous quantitative and perceptual evaluation protocols show consistent improvement across SOTA competitors in both vision and spatial signal contexts.

In summary, Aggregated Contextual-Transformation GANs leverage the AOT block to deliver advanced multi-scale, context-aware generative modeling, providing a state-of-the-art solution for both high-resolution image inpainting and radio map synthesis (Zeng et al., 2021, Qi et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Aggregated Contextual-Transformation GAN (AOT-GAN).