Papers
Topics
Authors
Recent
2000 character limit reached

Self-Modulated Convolution (SMC)

Updated 6 January 2026
  • Self-Modulated Convolution (SMC) is a neural network operator that replaces fixed filters with data-dependent modulation, enabling adaptive feature transformations.
  • In computer vision, SMC is exemplified by Affine Self Convolution (ASC) which employs spatially-local affine maps to reweight features and maintain translation equivariance.
  • For hyperspectral denoising, SMC uses spectral self-modulation modules to generate context-aware scaling and bias, driving improved noise reduction and feature clarity.

Self-Modulated Convolution (SMC) denotes a class of neural network operators in which convolutional processing is dynamically modulated by data-dependent parameters, resulting in filters or feature transformations that are adaptively determined by either neighboring spatial pixels (as in vision models) or spectral context (as in hyperspectral image restoration). Two prominent instantiations include the Affine Self Convolution (ASC) for self-attention-influenced convolution in computer vision (Diaconu et al., 2019), and spectral self-modulation modules for hyperspectral denoising (Torun et al., 2023). These approaches share a core principle: classical fixed kernels or feature normalizations are replaced (respectively) with operators whose modulation parameters are predicted from the data, imbuing the network blocks with adaptivity at inference-time.

1. Mathematical Formulations of Self-Modulated Convolution Operators

In ASC, the convolution filter at each image position xx is modulated through a spatially-local affine map: Ax[f](y)=f(y)⋅Lx[ψ](y)+Lx[β](y)A_{x}[f](y) = f(y) \cdot L_{x}[\psi](y) + L_{x}[\beta](y) where Lx[g](y)=g(y−x)L_{x}[g](y) = g(y-x) denotes spatial translation, ψ:Z2→R\psi:\mathbb{Z}^2 \to \mathbb{R} is a convolution filter, and β:Z2→R\beta:\mathbb{Z}^2 \to \mathbb{R} is a bias. This map affinely reweights the local neighborhood for each pixel.

For spectral self-modulation of deep features in hyperspectral image denoising, the mechanism combines instance normalization of a feature-map X∈Rh×w×CX \in \mathbb{R}^{h\times w\times C} with an adaptive affine transform based on KK adjacent spectral bands Yλ∈Rh×w×KY^\lambda \in \mathbb{R}^{h\times w\times K}: X^i,j,c=Xi,j,c−μcσc,Xi,j,c′=γi,j,c X^i,j,c+βi,j,c\hat{X}_{i,j,c} = \frac{X_{i,j,c} - \mu_c}{\sigma_c}, \quad X'_{i,j,c} = \gamma_{i,j,c}\,\hat{X}_{i,j,c} + \beta_{i,j,c} where scale γ\gamma and bias β\beta are generated via small conv-nets from the spectral context, ensuring spatial alignment. This scheme does not adjust kernels but the post-activation features, resulting in strong data-adaptive transformations.

2. Architectural Embodiments and Modulation Workflow

In ASC-equipped residual bottleneck blocks (e.g., ResNet-29 variants), the architecture applies 1×1 convolutions to produce QQ, KK, and VV embeddings. These are then processed via depth-wise convolutional affine maps for each head and spatial patch, resulting in local self-attention weights: α(x,y)=softmaxy(AxQ[Q](x)⋅AxK[K](y)dk)\alpha(x, y) = \mathrm{softmax}_{y}\left( \frac{A^Q_{x}[Q](x) \cdot A^K_{x}[K](y)}{\sqrt{d_k}} \right) and output: outputc(x)=∑yα(x,y)⋅AxV[V]c(y)\text{output}_c(x) = \sum_{y} \alpha(x, y) \cdot A^V_{x}[V]_c(y) Multi-headed aggregation and final 1×1 convolution complete the block.

For hyperspectral image denoising, the Spectral Self-Modulating Residual Block (SSMRB) integrates spectral self-modulation modules (SSMMs) after each 3×3 convolution. The SSMM computes per-pixel scale and bias from spectral neighbors, applying affine modulation to instance-normalized features before residual addition.

3. Equivariance, Invariance, and Group Extensions

ASC is proven translation equivariant, i.e., applying a spatial shift to the input results in an identical shift of the output: □Z2[Lz[f],A](x)=Lz[□Z2[f,A]](x)\square_{\mathbb{Z}^2}[L_{z}[f], A](x) = L_{z}[\square_{\mathbb{Z}^2}[f, A]](x) The proof follows from variable substitution and properties of the translation operator, as detailed in (Diaconu et al., 2019), Appendix Eqns. (14–16).

The ASC framework supports group-equivariant and group-invariant extensions, notably Squeeze-and-Excitation (SE) modules generalized over the roto-translation group p4≊Z2⋊C4p4 \approxeq \mathbb{Z}^2 \rtimes C_4. Here, global pooling is replaced by averaging over the group, maintaining equivariance under Z2\mathbb{Z}^2 shifts and 90° rotations.

4. Practical Implementation and Training Protocols

For ASC, network training utilizes SGD with Nesterov momentum (momentum = 0.9, weight decay = 1e–4), He initialization, ZeroInit scaling after BatchNorm for stability, and multi-head attention (e.g., H=8H=8, kernel size K=5K=5). Data augmentation includes spatial cropping and flips on CIFAR-10/100. The architecture replaces conventional 3×3 convolutions in ResNet bottlenecks with ASC layers.

For spectral self-modulation networks, patches of size 20×20 are sampled (stride = 10), with K=24K=24 adjacent bands as context, instance normalization, and scale/bias conv-nets per SSMM. The optimizer is Adam (no weight decay), with Xavier normal initialization and random rotation augmentation. At inference, continuity of spectral bands is enforced by flipping first/last bands.

5. Empirical Performance in Vision and Hyperspectral Denoising

On CIFAR-10 and CIFAR-100, ASC-augmented ResNet architectures achieve parameter reduction and accuracy matching or exceeding standard conv and self-attention baselines. Key results include:

  • ResNet29+ASC (268k params): 91.24% ± 0.32 accuracy (CIFAR-10)
  • ResNet29+ASC (291k params): 68.68% accuracy (CIFAR-100)
  • p4-equivariant ResNet29+ASC (272k/283k params): 93.03% and 72.71% (respectively)

In hyperspectral denoising, SM-CNN (stacked SSMRBs) significantly improves metrics on synthetic and real noise cases:

  • WDC mixture noise Case 5: MPSNR = 29.83 dB (+1.6 dB over best prior), MSSIM = 0.973, SAM = 0.066
  • Cross-dataset (Pavia University, no retraining): MPSNR = 31.36 ± 0.12 dB; MSSIM = 0.923 ± 0.002; SAM = 0.124 ± 0.001
  • Real Indian Pines (denoising + SVM): Accuracy = 89.31%, kappa = 0.8781 (beats QRNN3D baselines)
  • Real HSIDwRD (trained on synthetic noise): MPSNR = 32.04 dB, MSSIM = 0.940, SAM = 0.139

This suggests that data-driven modulation of features enables enhanced adaptability to complex, nonstationary noise and group-structured symmetries, driving robust generalization across domains.

Self-Modulated Convolution differs from dynamic convolution, which actively generates kernels, by modulating either convolutional filters (ASC) or normalization parameters (SSMRB) on-the-fly as functions of local data context. In ASC, modulation occurs at the filter level, mediated by soft attention, while in SM-CNN, modulation is at the normalized feature level, creating spatially- and spectrally-varying affine transforms.

A plausible implication is that self-modulation leverages rich context dependencies for robust feature transformation, without resorting to parameter economically inefficient dynamic kernel generation. This drives improved performance (accuracy and denoising metrics) while maintaining or reducing the overall model parameter count, and supports generalization to unseen domains.

7. Future Directions and Open Questions

Continued research may explore extensions of self-modulation to additional group structures, higher-dimensional contexts, and alternative modalities. Investigation into trade-offs between modulation complexity, parameter efficiency, and empirical generalization remains warranted. Open questions include the optimal architectural integration of self-modulation for diverse application domains and its synergy with emerging attention and normalization strategies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Self-Modulated Convolution (SMC).