CoordGate for Spatially-Aware CNNs

Updated 5 March 2026

CoordGate is a lightweight neural module enabling per-pixel spatial modulation for adapting convolutional filters to local imaging conditions.
It employs a coordinate-encoding MLP with a multiplicative gating mechanism to achieve spatially-aware filtering, improving deblurring and related imaging tasks.
Integration into CNNs, especially U-Net architectures, demonstrates enhanced performance metrics (PSNR, SSIM) with minimal computational overhead.

CoordGate is a lightweight neural network module designed to enable efficient computation of spatially-varying convolutions within Convolutional Neural Networks (CNNs), particularly for applications such as image deblurring. By leveraging a multiplicative gating mechanism coupled with a coordinate-encoding multilayer perceptron (MLP), CoordGate enables per-pixel modulation of convolutional filter responses based on spatial position. This approach provides a mechanism for spatially aware, locally varying filtering with CNNs, addressing the limitations of standard architectures under position-dependent degraded imaging conditions (Howard et al., 2024).

1. Problem Formulation: Spatially-Varying Convolutions

In optical and computational imaging, the system’s point-spread function (PSF) $h$ describes how a point source at coordinate $(x, y)$ is mapped onto the sensor. While ideal systems feature spatially invariant PSFs, real systems exhibit spatially-varying PSFs due to optical aberrations or defocus. For a “static, spatially-varying convolution,” the PSF at each pixel is fixed in advance and varies across the image coordinates, but not with the scene content.

The forward model for a blurred image $Y(x, y)$ produced from an input $X(x, y)$ is:

$Y(x, y) = \sum_{i, j} h_{x, y}(i, j) X(x + i, y + j)$

where $h_{x,y}(i,j)$ is the PSF kernel at $(x, y)$ . In a vector-matrix formalism, this is represented as:

$\mathbf{y} = \mathbf{H}\mathbf{x}$

where $\mathbf{H}$ is a large sparse matrix whose rows correspond to PSFs varying smoothly with spatial index.

2. CoordGate Architecture: Gated Spatial Modulation

CoordGate addresses spatially-varying convolution by decomposing the process into two components: a convolutional block producing a basis of filters and a spatially steered gating mechanism. Specifically,

Let $X \in \mathbb{R}^{n_x \times n_y \times n_c}$ be the input feature map.
$h(X) \in \mathbb{R}^{n_x \times n_y \times n_c^\ell}$ is the output of a convolutional block (with $n_c^\ell$ channels).
$C \in \mathbb{R}^{n_x \times n_y \times 2}$ is a fixed coordinate tensor of normalized $(x, y)$ values per pixel.
$g(\cdot): \mathbb{R}^2 \rightarrow \mathbb{R}^{n_c^\ell}$ is a small coordinate encoding network (MLP) applied per pixel so that $G(x, y) = g(C(x, y))$ .

The output is then gated multiplicatively:

$F_{\rm out}(x, y) = G(x, y) \odot h(X)(x, y)$

Or equivalently,

$Y_{i, a} = h(X)_{i, a} \times g(C_i)_a, \quad i = (x, y),\ a = 1,\dots, n_c^\ell$

The coordinate encoder $g$ is typically a shallow MLP with $p$ hidden layers, each of width $n_c^\ell$ . The network at pixel $i$ is defined as:

$z^{(1)} = W^{(1)} C_i + b^{(1)},$

$z^{(k + 1)} = \sigma(W^{(k + 1)} z^{(k)} + b^{(k + 1)}),\quad k = 1,\ldots, p-1,$

$G(x, y) = z^{(p)} \in \mathbb{R}^{n_c^\ell}$

3. Integration with CNNs and U-Net

CoordGate modules are typically integrated into CNNs after blocks of convolutions, with a particular focus on U-Net architectures. In standard practice:

After each pair of $3 \times 3$ convolutions (followed by ReLU), the CoordGate module is applied.
In U-Net encoders, the sequence is: two $3 \times 3$ convolutions → CoordGate → $2 \times 2$ max-pooling.
In U-Net decoders, upsampling replaces pooling.

A procedural outline for a U-Net block with CoordGate:

function UBlock_s(Input X, CoordMap C):
    H = Conv3x3_ReLU(Conv3x3_ReLU(X))   # [n_x, n_y, n_c^ℓ]
    G = CoordEncoder(C)                 # [n_x, n_y, n_c^ℓ]
    X' = H * G                          # Elementwise multiplication
    return X'

After the last CoordGate block, a $1 \times 1$ convolution can be optionally used to restore the desired channel count for residual learning.

4. Training Procedures and Computational Considerations

Loss function: Mean-squared error (MSE) between output and ground-truth image; peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) are also reported.
Data: Approximately 20,000 real microscopy images, synthetically blurred with a spatially-varying, lens-like PSF.
Optimization: Adam optimizer; initial learning rate $1 \times 10^{-3}$ , halved upon plateauing of validation loss across 20 epochs, for a total of 600 epochs. Standard weight decay ( $10^{-4}$ ) is employed.
Computational overhead:
- U-Net(3) baseline: $\sim$ 0.5M params
- CG U-Net(3): $\sim$ 0.6M params (three gates/step; MLP: $p=2$ hidden layers, 32 units/layer)
- U-Net(6): $\sim$ 18M params
- CG U-Net(6): $\sim$ 18.2M params
- Inference time (per $512 \times 512$ image, single GPU): U-Net(6): $\sim$ 34ms; CG U-Net(3): $\sim$ 12ms; CG U-Net(6): $\sim$ 38ms

The multiplicative gating via pixel-wise MLP introduces only minor overhead; small CoordGate-augmented networks match or outperform much larger conventional networks.

5. Experimental Results and Ablation Studies

A comparative summary of key models:

Model	#Params	PSNR (dB)	SSIM
U-Net(3)	0.5 M	26.4	0.81
U-Net(5)	4.2 M	28.7	0.87
U-Net(6)	18 M	29.1	0.88
CoordConv-UNet	19 M	29.2	0.89
MultiWienerNet	20 M	29.5	0.90
CG U-Net(3)	0.6 M	29.8	0.91
CG U-Net(6)	18.2 M	31.2	0.93

Key findings from ablation studies:

Removing the coordinate encoder (replacing $C$ with random noise) prevents the model from learning spatially-varying solutions.
Omitting the multiplicative gate and merely concatenating coordinates (CoordConv) returns performance to that of the standard U-Net.
Increasing the MLP depth $p$ in $g$ beyond $p=2$ yields diminishing returns, indicating sufficiency of lightweight architecture.

6. Mechanistic Insights and Applicability Beyond Deblurring

The multiplicative gating mechanism, together with coordinate encoding, enables a mapping from spatial coordinates $(x, y)$ directly to convolutional filter weights, providing precise alignment with the locally-varying PSF. Multiplicative modulation preserves the parameter sharing of standard CNNs, enhancing representational efficiency for spatially heterogeneous tasks. In contrast, approaches such as CoordConv require the network to relearn positional semantics within the convolution weights, and padding-based positional encoding propagates spatial information less efficiently.

CoordGate generalizes to tasks where local filtering must adapt to spatially-dependent conditions:

Super-resolution: When the PSF of the downsampling kernel varies spatially (e.g., due to lens fall-off in mobile cameras), CoordGate can modulate the upsampler per pixel.
Denoising: In cases with spatially varying sensor noise covariance (e.g., sCMOS microscopes), filters can be adaptively reweighted based on noise statistics.
Other spatially-varying transformations: CoordGate is applicable to spatially-varying tone mapping, vignetting correction, or any task where optimal filter response is a coordinate-dependent function.

CoordGate thus achieves an end-to-end differentiable, spatially-aware filtering regime, matching the expressivity of locally connected layers while retaining the efficiency and parameter sharing advantages of CNNs (Howard et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

CoordGate: Efficiently Computing Spatially-Varying Convolutions in Convolutional Neural Networks (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoordGate.