Geometric Affine Modulation Module (GAM)

Updated 23 November 2025

Geometric Affine Modulation Module (GAM) is a neural network component that applies learned affine transformations to balance modality-specific information with a unified feature space.
It integrates channel-wise affine modulation with spatially localized affine deformation to enhance performance in multi-modality image fusion and generative models.
Empirical evidence shows GAM improves fusion quality metrics and generation fidelity by re-injecting global geometric cues and modality statistics into the feature maps.

The Geometric Affine Modulation Module (GAM) is a neural network component designed to preserve modality-specific discriminability while maintaining a unified feature space in multi-modality image fusion, and more broadly, to enable explicit geometric control over intermediate feature representations via learned affine transformations. The GAM concept encompasses channel-space affine modulation methodologies as employed in multi-modal fusion architectures, as well as spatially local affine deformation as realized via modulated transformation in high-diversity generative settings.

1. Motivations and Conceptual Overview

The core challenge in unified multi-modality architectures is balancing generalization across varied modalities with the retention of unique, modality-specific signatures. In unified image fusion pipelines, standard dual-branch encoders tend to overfit to their input modality, leading to reduced efficacy on novel modality pairs. The GAM is introduced to inject global “geometric” cues—captured as compact global statistics from the input—back into the preliminary fused features generated by the encoder. This operation is realized as a channel-wise affine transformation, re-coloring these features to retain essential information unique to each modality while keeping the main information content in a space amenable to cross-modality fusion (Li et al., 16 Nov 2025).

In the context of image generation, the analogous mechanism is the Modulated Transformation Module (MTM), which extends this concept from global channel-wise affine modulation to dense, pixel-wise affine warping of sampling grids. Here, modulation predicts per-location offsets for convolutional kernels, adding a degree of freedom to geometry beyond instance-level channel statistics (Yang et al., 2023).

2. Architectural Elements

Channel-wise Geometric Affine Modulation

In unified image fusion (specifically, the UP-Fusion architecture), GAM is inserted after the Semantic-Aware Channel Pruning Module (SCPM) and acts as follows for each modality $M$ :

The initial feature map produced by the first convolution ( $F_{\text{init}}^M \in \mathbb{R}^{C \times H \times W}$ ) is globally average pooled:

$F_G^M = \frac{1}{HW} \sum_{x=1}^H \sum_{y=1}^W F_{\text{init}}^M(x, y)$

$F_G^M$ is processed by two consecutive $1 \times 1$ Conv layers (with ReLU in-between, no normalization), producing vectors $\gamma^M, \beta^M \in \mathbb{R}^C$ :

$[\gamma^M, \beta^M] = \mathrm{Conv}_{1\times 1}\bigl(\mathrm{ReLU}(\mathrm{Conv}_{1\times 1}(F_G^M))\bigr)$

The affine modulation is applied channel-wise to the pruned-and-restored feature $Fuse^M$ :

$F_O^M(x, y, c) = Fuse^M(x, y, c) \cdot [1 + \gamma^M(c)] + \beta^M(c)$

or equivalently,

$F_O^M = Fuse^M \odot (1 + \gamma^M) + \beta^M$

where $\odot$ denotes Hadamard product broadcast over spatial dimensions.

Geometric Modulation in Generative Models: Modulated Transformation Module

For generative adversarial networks (GANs), the MTM predicts a dense field of 2D offsets for each convolutional kernel position, based on the input latent vector and the feature map at that layer:

Given $z \sim \mathcal{N}(0, I)$ and style code $y = \mathrm{MLP}(z)$ , a modulated convolution predicts offsets for each position and each sampling location $k$ :

$\Delta P_\ell = f_\theta(X_\ell, y_\ell),$

where $X_\ell \in \mathbb{R}^{C\times H \times W}$ , $\Delta P_\ell \in \mathbb{R}^{2K \times H \times W}$ .

The feature map for the next layer samples from spatially deformed positions:

$X_{\ell+1}^{c_{\text{out}}}(p_0) = \sum_{c_\text{in}=1}^{C_\text{in}} \sum_{k=1}^{K} w_\ell[c_{\text{out}}, c_{\text{in}}, k] \cdot \text{Interp} \left( X_\ell^{c_{\text{in}}}, p_0 + p_k + \Delta p_{\ell, k}(p_0) \right)$

This dynamic spatial manipulation can be interpreted as a local, per-instance, content-aware affine warp of the convolutional receptive field, generalizing classical global spatial transforms (Yang et al., 2023).

A critical function of the channel-wise GAM is to preserve information unique to each modality by re-injecting the original modality’s geometric and intensity statistics via learned $\gamma^M$ and $\beta^M$ . Since these parameters originate from modality-specific input statistics, their application “re-colors” the shared feature space such that each encoding path retains hints of its input domain. This enables the main information content to be shared, facilitating cross-modality communication via later modules (e.g., the Text-Guided Channel Perturbation Module), while still discouraging the collapse of the encoder into fully modality-specific spaces (Li et al., 16 Nov 2025).

In the generative context, local affine grid modulation (via the MTM) introduces sample-level geometry variation, such as pose or articulation, which is not possible with pure channel-wise affine modulation. This control of the sampling grid empowers the generator to model more complex spatial deformations (Yang et al., 2023).

4. Implementation Characteristics and Integration

Multi-Modality Fusion (UP-Fusion)

Each modality's GAM uses two $1 \times 1$ Conv layers (input/output: $C$ channels), separated by ReLU, with no batch or layer normalization.
Global average pooling for $F_G^M$ has no parameters.
The parameter overhead of GAM is $\sim 2C^2$ , which is negligible relative to Transformer encoder complexity.
GAM parameters are optimized jointly with the network using Adam (initial learning rate $10^{-4}$ , cosine annealed to $10^{-5}$ over 100 epochs) under the composite loss $L_\text{total} = L_\text{grad} + L_{l_1}$ .
There is no auxiliary loss or adversarial loss specific to GAM; updates are driven by the main fusion objectives (Li et al., 16 Nov 2025).

Generative Modeling (MTM)

The per-layer offset field is predicted by a single style-modulated $3\times3$ convolution, with $2K$ output channels (where $K$ is the number of sampling points per location, e.g., $K=9$ for a $3\times3$ kernel).
Offsets are directly applied to deform each convolutional kernel's receptive field for every spatial location.
The MTM is inserted only in low-resolution generator blocks to manage computational cost, yielding most performance gains in 2D and 3D-aware generation without major increases in memory or training time.
Training hyperparameters and stability tricks follow those of StyleGAN2, including demodulation and the $R_1$ penalty (Yang et al., 2023).

5. Quantitative Impact and Empirical Evidence

Ablation studies in multi-modality fusion establish that removing the channel-wise GAM from the UP-Fusion pipeline leads to measurable drops in critical quality metrics on the Harvard medical fusion dataset:

	Full Model	w/o GAM
$Q_{NCIE}$	0.8074	0.8071
$Q_P$	0.5665	0.5488
VIF	0.3190	0.3151
SSIM	0.3639	0.3478

Qualitative analysis indicates that, without GAM, the fusion output may show modality-specific artifacts (e.g., SPECT background incorrectly fused into MRI), while the inclusion of GAM corrects these failures and supports modality-appropriate detail retention (Li et al., 16 Nov 2025).

For MTM in StyleGAN-based generative models, inserting the module results in consistent improvements in FID and FVD scores on datasets with substantial pose or structural diversity. For example, in TaiChi-256 generation, StyleGAN3’s FID improves from 21.36 to 13.60 when augmented with MTM; similar performance gains are evident across 2D, 3D-aware, and video synthesis benchmarks (Yang et al., 2023).

Standard style modulation (e.g., AdaIN, StyleGAN’s channel scaling and shifting) applies instance-wise affine transforms in channel space, sculpting global texture or appearance. However, such modulation cannot alter the spatial geometry of features. The geometric affine modulation paradigm, as instantiated in both GAM and MTM, extends modulation by enabling explicit geometric control: in the former case, as global “re-biasing” towards modality-specific statistics, and in the latter as dense, location-dependent affine warping of convolutional input grids. This suggests that geometric affine modulation bridges the gap between content-awareness and domain control, enabling both discriminability preservation and spatial articulation (Li et al., 16 Nov 2025, Yang et al., 2023).

7. Significance and Broader Implications

The Geometric Affine Modulation framework underpins a sub-class of plug-and-play modules that enrich representation learning by balancing global domain bias and flexible spatial expressiveness. In multi-modality fusion, GAM addresses the tension between generalizable encodings and the retention of modality-unique cues, directly improving generalization to unseen modality combinations and preventing redundant or spurious fusion artifacts. In generative modeling, geometric affine modulation allows for latent-driven, non-rigid transformations that extend the expressive capacity of style-based generators beyond global channel-wise adaptations, yielding robust improvements in image and video synthesis tasks dominated by geometric diversity.

These findings underscore the utility and generality of geometric affine modulation in high-diversity, multi-modal visual tasks (Li et al., 16 Nov 2025, Yang et al., 2023).

PDF Markdown Chat (Pro)

References (2)

Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion (2025)

Learning Modulated Transformation in GANs (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Geometric Affine Modulation Module (GAM).