CoordGate for Spatially-Aware CNNs
- CoordGate is a lightweight neural module enabling per-pixel spatial modulation for adapting convolutional filters to local imaging conditions.
- It employs a coordinate-encoding MLP with a multiplicative gating mechanism to achieve spatially-aware filtering, improving deblurring and related imaging tasks.
- Integration into CNNs, especially U-Net architectures, demonstrates enhanced performance metrics (PSNR, SSIM) with minimal computational overhead.
CoordGate is a lightweight neural network module designed to enable efficient computation of spatially-varying convolutions within Convolutional Neural Networks (CNNs), particularly for applications such as image deblurring. By leveraging a multiplicative gating mechanism coupled with a coordinate-encoding multilayer perceptron (MLP), CoordGate enables per-pixel modulation of convolutional filter responses based on spatial position. This approach provides a mechanism for spatially aware, locally varying filtering with CNNs, addressing the limitations of standard architectures under position-dependent degraded imaging conditions (Howard et al., 2024).
1. Problem Formulation: Spatially-Varying Convolutions
In optical and computational imaging, the system’s point-spread function (PSF) describes how a point source at coordinate is mapped onto the sensor. While ideal systems feature spatially invariant PSFs, real systems exhibit spatially-varying PSFs due to optical aberrations or defocus. For a “static, spatially-varying convolution,” the PSF at each pixel is fixed in advance and varies across the image coordinates, but not with the scene content.
The forward model for a blurred image produced from an input is:
where is the PSF kernel at . In a vector-matrix formalism, this is represented as:
where is a large sparse matrix whose rows correspond to PSFs varying smoothly with spatial index.
2. CoordGate Architecture: Gated Spatial Modulation
CoordGate addresses spatially-varying convolution by decomposing the process into two components: a convolutional block producing a basis of filters and a spatially steered gating mechanism. Specifically,
- Let be the input feature map.
- is the output of a convolutional block (with channels).
- is a fixed coordinate tensor of normalized values per pixel.
- is a small coordinate encoding network (MLP) applied per pixel so that .
The output is then gated multiplicatively:
Or equivalently,
The coordinate encoder is typically a shallow MLP with hidden layers, each of width . The network at pixel is defined as:
3. Integration with CNNs and U-Net
CoordGate modules are typically integrated into CNNs after blocks of convolutions, with a particular focus on U-Net architectures. In standard practice:
- After each pair of convolutions (followed by ReLU), the CoordGate module is applied.
- In U-Net encoders, the sequence is: two convolutions → CoordGate → max-pooling.
- In U-Net decoders, upsampling replaces pooling.
A procedural outline for a U-Net block with CoordGate:
1 2 3 4 5 |
function UBlock_s(Input X, CoordMap C):
H = Conv3x3_ReLU(Conv3x3_ReLU(X)) # [n_x, n_y, n_c^ℓ]
G = CoordEncoder(C) # [n_x, n_y, n_c^ℓ]
X' = H * G # Elementwise multiplication
return X' |
After the last CoordGate block, a convolution can be optionally used to restore the desired channel count for residual learning.
4. Training Procedures and Computational Considerations
- Loss function: Mean-squared error (MSE) between output and ground-truth image; peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) are also reported.
- Data: Approximately 20,000 real microscopy images, synthetically blurred with a spatially-varying, lens-like PSF.
- Optimization: Adam optimizer; initial learning rate , halved upon plateauing of validation loss across 20 epochs, for a total of 600 epochs. Standard weight decay () is employed.
- Computational overhead:
- U-Net(3) baseline: 0.5M params
- CG U-Net(3): 0.6M params (three gates/step; MLP: hidden layers, 32 units/layer)
- U-Net(6): 18M params
- CG U-Net(6): 18.2M params
- Inference time (per image, single GPU): U-Net(6): 34ms; CG U-Net(3): 12ms; CG U-Net(6): 38ms
The multiplicative gating via pixel-wise MLP introduces only minor overhead; small CoordGate-augmented networks match or outperform much larger conventional networks.
5. Experimental Results and Ablation Studies
A comparative summary of key models:
| Model | #Params | PSNR (dB) | SSIM |
|---|---|---|---|
| U-Net(3) | 0.5 M | 26.4 | 0.81 |
| U-Net(5) | 4.2 M | 28.7 | 0.87 |
| U-Net(6) | 18 M | 29.1 | 0.88 |
| CoordConv-UNet | 19 M | 29.2 | 0.89 |
| MultiWienerNet | 20 M | 29.5 | 0.90 |
| CG U-Net(3) | 0.6 M | 29.8 | 0.91 |
| CG U-Net(6) | 18.2 M | 31.2 | 0.93 |
Key findings from ablation studies:
- Removing the coordinate encoder (replacing with random noise) prevents the model from learning spatially-varying solutions.
- Omitting the multiplicative gate and merely concatenating coordinates (CoordConv) returns performance to that of the standard U-Net.
- Increasing the MLP depth in beyond yields diminishing returns, indicating sufficiency of lightweight architecture.
6. Mechanistic Insights and Applicability Beyond Deblurring
The multiplicative gating mechanism, together with coordinate encoding, enables a mapping from spatial coordinates directly to convolutional filter weights, providing precise alignment with the locally-varying PSF. Multiplicative modulation preserves the parameter sharing of standard CNNs, enhancing representational efficiency for spatially heterogeneous tasks. In contrast, approaches such as CoordConv require the network to relearn positional semantics within the convolution weights, and padding-based positional encoding propagates spatial information less efficiently.
CoordGate generalizes to tasks where local filtering must adapt to spatially-dependent conditions:
- Super-resolution: When the PSF of the downsampling kernel varies spatially (e.g., due to lens fall-off in mobile cameras), CoordGate can modulate the upsampler per pixel.
- Denoising: In cases with spatially varying sensor noise covariance (e.g., sCMOS microscopes), filters can be adaptively reweighted based on noise statistics.
- Other spatially-varying transformations: CoordGate is applicable to spatially-varying tone mapping, vignetting correction, or any task where optimal filter response is a coordinate-dependent function.
CoordGate thus achieves an end-to-end differentiable, spatially-aware filtering regime, matching the expressivity of locally connected layers while retaining the efficiency and parameter sharing advantages of CNNs (Howard et al., 2024).