Adaptive Conditional Feature Modulator

Updated 16 January 2026

ACFM is a class of neural network modules that conditionally and adaptively modulate feature representations using learned affine or convolutional transforms from contextual cues.
It enables continuous, context-aware adaptation across tasks like image restoration, video coding, and medical registration while maintaining near-baseline performance.
Empirical results show ACFM achieves performance gaps below 0.3 dB with minimal parameter overhead through efficient two-phase training.

Adaptive Conditional Feature Modulator (ACFM) designates a class of neural network modules that learn to conditionally and adaptively alter feature representations via learned affine or convolutional transformations derived from contextual information, task parameters, or input statistics. ACFM layers generalize normalization, conditional convolution, and modulation techniques, enabling neural networks to perform continuous, context-aware adaptation across restoration levels, contrast conditions, or coding regimes. Primarily deployed in image restoration, video compression, and medical image registration tasks, ACFM modules have demonstrated efficacy in bridging discrete training regimes and continuous test conditions, delivering strong generalization with minimal parameter overhead (He et al., 2019, Chen et al., 2022, Wang et al., 9 Jan 2026).

1. Principles and General Mechanisms

At their core, ACFM modules insert conditional affine or convolutional operations into feature-processing pipelines of deep networks. Adaptiveness is achieved by dynamically predicting modulation parameters—scaling, shifting, or small convolutional kernels—according to context vectors, such as restoration levels, content features, coding parameters, or image contrast statistics.

An ACFM typically modulates a feature map $F \in \mathbb{R}^{C \times H \times W}$ as: $\hat{F}_c = \gamma_c(F, \text{cond}) \cdot F_c + \beta_c(F, \text{cond}),$ where $\gamma_c$ and $\beta_c$ are channel-wise scale and shift computed as a function of $F$ and additional conditioning signals (Chen et al., 2022, Wang et al., 9 Jan 2026). Variants use depthwise convolutions in place of scaling, or apply normalization before modulation (He et al., 2019, Wang et al., 9 Jan 2026).

Conditioning sources are chosen for maximal relevance to the intended adaptation:

Image restoration: interpolation coefficients encoding restoration level (He et al., 2019).
Video coding: content features, coding level, and rate-control parameter (Chen et al., 2022).
Medical registration: low-frequency bands of the input (contrast information) (Wang et al., 9 Jan 2026).

2. Mathematical Formulations

ACFM modules are implemented via either direct affine modulation or generalized depthwise convolution. Representative formulations include:

Application Domain	Modulation Operation	Conditioning Signal
Image restoration (He et al., 2019)	$\mathrm{AdaFM}(x_i) = g_i * x_i + b_i$	Interpolation between identity and learned kernel/bias
Video coding (Chen et al., 2022)	$\hat{F}_{\ell} = \gamma_{\ell} \odot F_{\ell} + \beta_{\ell}$	Content $c_{\ell}$ , coding level $C$ , rate param. $\lambda$
Medical image registration (Wang et al., 9 Jan 2026)	$h'_i = \alpha_{\theta,i}(v_i) \frac{h_i-\mu(h_i)}{\sigma(h_i)} + \beta_{\theta,i}(v_i)$	DWT-LL content vector $v_i$

In each case, $\gamma$ , $\beta$ (or $g_i$ , $b_i$ ) are produced by compact sub-networks (e.g., MLPs, linear projections, or table lookups) using the current conditioning vector.

For continuous-interpolation tasks (He et al., 2019), ACFM parameters are interpolated between the identity and a learned adapted state via a scalar $\alpha$ : $g_i(\alpha) = I + \alpha(g_i - I), \quad b_i(\alpha) = \alpha b_i$ enabling a “slider” interface to control restoration strength with a simple mapping $\alpha = f(L_c)$ for each required level $L_c$ .

3. Training Strategies and Integration

Typical ACFM deployment proceeds in two or more training phases:

Base training: Train the network backbone at a canonical operating point (e.g., minimum degradation or lowest compression level).
ACFM-layer adaptation: Insert ACFM modules, freeze the backbone, and learn only the ACFM parameters for a new operating point (e.g., maximum degradation, higher bit rate) (He et al., 2019).
Multi-parameter conditioning: Alternatively, as in video coding or registration, the model is trained end-to-end to map content and auxiliary signals to modulation parameters (content-GAP features, coding level, lambda, or low-frequency DWT bands) (Chen et al., 2022, Wang et al., 9 Jan 2026).

In image restoration, AdaFM layers are positioned post-convolution and pre-activation within each residual block; for super-resolution, a $5 \times 5$ kernel is optimal, while denoising and deblocking use $1 \times 1$ (per-channel affine) (He et al., 2019). In video coding, ACFM is incorporated after every convolution in both motion and inter-frame codecs, using a two-layer MLP to process concatenated content and context embeddings (Chen et al., 2022). In medical image registration, ACFM uses instance normalization modulated by projection heads conditioned on DWT low-frequency content, placed after every encoder convolution at each U-Net scale (Wang et al., 9 Jan 2026).

Optimization is performed with standard losses for the main task (pixel L1, MSE, or LNCC for registration) plus auxiliary losses to encourage invariance (e.g., contrast-invariance loss for medical registration) (Wang et al., 9 Jan 2026).

4. Empirical Performance and Ablation Analyses

Quantitative assessment consistently reveals that ACFM-enabled models closely approach (0.1–0.2 dB gap) the performance of models trained specifically for each target condition, but within a single model instance:

Image denoising on CBSD68: Baseline PSNR 26.49 dB, ACFM 26.35 dB ( $\Delta$ 0.14 dB).
JPEG deblocking on LIVE1: Baseline 29.55 dB, ACFM 29.35 dB ( $\Delta$ 0.20 dB).
Super-resolution (Set5): Baseline 32.13 dB, ACFM 32.00 dB ( $\Delta$ 0.13 dB) (He et al., 2019).

Ablation studies support the following:

Depthwise kernel size: $5 \times 5$ for super-resolution, $1 \times 1$ sufficient for denoising/deblocking (He et al., 2019).
Modulation parameter predictors: Content- and context-aware projection heads (MLPs) deliver measurable gains over static or hand-coded modulation (Chen et al., 2022, Wang et al., 9 Jan 2026).
Low-frequency DWT conditioning: Using only LL band for ACFM in medical domain outperforms high-/all-band alternatives (Wang et al., 9 Jan 2026).

Combined ACFM and contrast-invariance latent regularization (CLR) improve generalization across unseen contrasts by 2–5% Dice in brain MRI and cardiac mapping; separation of contributions shows ACFM alone provides much of the gain, with CLR adding further improvements (Wang et al., 9 Jan 2026).

5. Application Domains and Functionality

Adaptive Conditional Feature Modulators are established in three primary domains:

Image restoration: ACFM (as AdaFM) allows smooth control of noise, JPEG artifact strength, or upscaling factor without retraining, using a “slider” interface ( $\alpha$ parameter) for continuous user-directed modulation (He et al., 2019).
Learned video coding: ACFM integrates content and coding-level adaptability via feature-wise affine modulation after every convolution, enabling a single network to handle hierarchical B-frame, YUV 4:2:0 content, and continuous-rate adaptation (Chen et al., 2022).
Medical image registration: ACFM in the AC-CAR framework yields contrast-agnostic latent representations by normalizing and re-scaling features according to the low-frequency content, supporting robust generalization to arbitrary (even unobserved) imaging contrasts, and improves both registration accuracy and reliability (Wang et al., 9 Jan 2026).

A plausible implication is that the modular nature of ACFM layers, with lightweight parameter overhead (typically 0.2–4% for image restoration, minimal extra FLOPs in registration), makes them suitable for fast, interactive, or edge applications where re-deployment or retraining is impractical.

6. Implementation Notes and Practical Considerations

Empirical guidelines for ACFM design include:

Restrict the modulation “range” between adaptation endpoints to maintain performance gaps <$0.3$ dB; segment broader ranges as necessary (He et al., 2019).
Employ task-aligned filter sizes ( $5 \times 5$ for SR, $1 \times 1$ for denoising/deblocking).
Condition on relevant, information-rich cues (content via GAP, context parameters, DWT-LL bands).
Use simple or low-degree polynomial fits for interpolation of modulation coefficients ( $\alpha$ ) in continuous-control scenarios; a linear relationship generally suffices for moderate ranges (He et al., 2019).
In registration, employ instance normalization modulated by per-scale ACFM blocks; initialize with careful contrast augmentation and weighted multi-task losses (Wang et al., 9 Jan 2026).

No dedicated regularizers for ACFM parameters are required; weight decay alone suffices during training. For medical tasks, performance should be monitored via feature-root-mean-square deviation (RMSD) and Dice scores to verify anatomical invariance.

7. Limitations and Prospective Directions

Current ACFM formulations rely on the quality of the conditioning vector and the range of training conditions. Excessive interpolation beyond two discrete endpoints can degrade performance; subdividing the modulation space mitigates this but increases adaptation complexity (He et al., 2019). ACFM generalization in dynamic-contrast or spatiotemporally heterogeneous contexts (e.g., video with temporally varying noise) warrants further investigation.

A plausible area for future research involves scalable and hierarchical ACFM architectures capable of robust adaptation across broader semantic and distributional shifts, leveraging recent advances in meta-learning, attention-based conditional modulation, or multi-modal context extraction. Integrating explicit uncertainty estimation, as in deformable registration, may further enhance reliability and interpretability in real-world deployments (Wang et al., 9 Jan 2026).