Spatial Alignment Module (SAM)

Updated 24 December 2025

Spatial Alignment Module (SAM) is a learnable neural component that aligns spatial features across inputs to improve fusion and downstream performance.
It uses techniques like deformable convolutions and window-based cross-attention to address misalignment in tasks such as colonoscopy segmentation and HDR imaging.
Empirical results highlight SAM's effectiveness, enhancing segmentation accuracy and suppressing ghosting in multi-exposure HDR through adaptive feature registration.

A Spatial Alignment Module (SAM) is a learned neural network component that enforces spatial consistency among corresponding feature representations from related inputs, thereby facilitating improved aggregation and subsequent processing. SAMs have been introduced in different domains— notably, in colonoscopy video polyp segmentation (Hu et al., 19 Jun 2024) and multi-exposure HDR image reconstruction (Li et al., 30 Jun 2025)—to address the challenges posed by misalignment due to camera or scene dynamics. These modules differ in specific architecture but share the unified objective of aligning features across frames or exposures to enable robust feature interaction and downstream fusion.

1. Motivations for Spatial Alignment

The principal motivation for SAM arises from the limitations of naïve feature fusion in the presence of spatial misalignment:

Colonoscopy video segmentation: The endoscope’s rapid and often non-uniform movement induces substantial apparent displacements and geometric variation between adjacent frames, unlike typical natural scene videos. Classical optical-flow or global attention mechanisms fail: optical flow is non-discriminative because both foreground and background move identically with the camera; spatio-temporal attention is brittle when input features are grossly misaligned. SAM is thus employed to "warp" or register the feature map from time $t-1$ to that of time $t$ before any fusion, stabilizing the aggregation in the face of abrupt viewpoint changes (Hu et al., 19 Jun 2024).
HDR fusion from multi-exposure LDR images: For HDR reconstruction, spatial misalignment ("ghosting") across exposures degrades results. Here, the alignment problem is formalized via a deep-unfolding framework from maximum a posteriori (MAP) inference, and the SAM plays the role of a learned gradient-descent update operator that solves for optimal spatial correspondence between features before fusion (Li et al., 30 Jun 2025).

Both cases highlight the necessity of data-driven, adaptive alignment mechanisms embedded within the main learning process.

2. Core Architectures

SAM instantiations differ by context but maintain analogous principles: local alignment of features via learnable transformations, followed by harmonized interaction.

Colonoscopy Video Segmentation SAM (SALI)

The SAM in the SALI network operates at three spatial scales. Its pipeline for two consecutive frames $f_{t-1}, f_t \in \mathbb{R}^{C \times h \times w}$ consists of three stages:

Offset Prediction (N₀): Features are concatenated and passed through a $1 \times 1$ convolution to predict $2K$ offset channels—where $K=9$ reflects the 3×3 deformable kernel locations—yielding a set of pixel-wise offsets $\Delta p_k$ .
Deformable Convolution (DConv): These offsets guide a 3×3 deformable convolution applied to $f_{t-1}$ , producing a spatially aligned feature $f_{t-1}'$ .
Feature Harmonization: Aligned and current features are concatenated and processed via a self-attention block, with local keys/values computed via max-pooling and a $3\times3$ convolution. A dot-product attention fuses the contextualized features to yield $F_\text{short}$ for further decoding or downstream long-term interaction (Hu et al., 19 Jun 2024).

HDR Fusion SAM (AFUNet)

Within AFUNet, the SAM operates per iteration and exposure. It takes the prior aligned feature $f_{\alpha_i}^{t-1}$ and the current HDR feature estimate $f_x^{t-1}$ and operates as a window-based cross-attention transformer block:

Reference Embedding: $f_x^{t-1}$ is projected via a two-layer MLP to yield $r_i$ .
Key-Value Construction: $K\!V = f_{\alpha_i}^{t-1} + r_i$ .
Window Partitioning: Features are partitioned into non-overlapping $M \times M$ patches for localized attention.
Cross-Attention: Within each window, standard QKV projections yield $Q, K, V$ ; scaled dot-product attention is computed and reassembled.
Feedforward Processing: Output undergoes two-layer FFN, with LayerNorms and GELU activations throughout (Li et al., 30 Jun 2025).

The result is a spatially aligned feature map $f_{\alpha_i}^{t}$ , directly ready for adaptive fusion.

3. Mathematical Formulation

The mathematical core of SAM is its learnable alignment transformation:

SALI deformable convolution:

$y(p_0) = \sum_{k=1}^{K} w_k \cdot x\bigl(p_0 + p_k + \Delta p_k(p_0)\bigr)$

$x(\cdot)$ is the input ( $f_{t-1}$ ), $w_k$ are learned weights, $p_k$ are fixed sampling offsets, $\Delta p_k$ are learned, data-dependent offsets from $N_0$ . This mechanism "warps" features for correspondence before fusion (Hu et al., 19 Jun 2024).

AFUNet unfolding alignment:

Alignment is formalized as an iterative update: for exposure $i$ ,

$\alpha_i^t = \alpha_i^{t-1} - \varsigma_i \nabla_{\alpha_i} p_i(D_i x^{t-1}, \alpha_i^{t-1}),$

which, under deep unfolding, is learned by

$f_{\alpha_i}^t = \mathrm{FFN}\left(\text{WCAA}(f_{\alpha_i}^{t-1}, f_x^{t-1})\right)$

where WCAA denotes "window-based cross-attention alignment" (Li et al., 30 Jun 2025).

4. Offset Learning and Adaptivity

In both implementations, the alignment parameters (offsets in SALI, transformer weights in AFUNet) are trained end-to-end with the task objective (segmentation loss for SALI, image PSNR for AFUNet). No auxiliary supervision for alignment is used. Offsets in the deformable convolution module are L2 regularized to avoid pathological displacements (Hu et al., 19 Jun 2024). The transformer variant in AFUNet adapts via both window partitioning and learned projection, learning context-specific correspondence without direct geometric supervision (Li et al., 30 Jun 2025).

5. Position Within Network Pipelines

SAMs operate as integral components in multi-stage neural architectures:

SALI: The aligned short-term features produced by SAM are either directly used for mask decoding or are further processed by a Long-term Interaction Module (LIM) that aggregates representations across more distant temporal horizons (Hu et al., 19 Jun 2024).
AFUNet: SAM appears as the first submodule in each Alignment-Fusion Module (AFM) per iteration; its output informs the subsequent Spatial Fusion Module, and, through multi-stage alternation, guides the network towards increasingly harmonized HDR feature reconstructions (Li et al., 30 Jun 2025).

6. Implementation and Hyperparameterization

A selection of empirical implementation details:

Parameter	SALI (Hu et al., 19 Jun 2024)	AFUNet (Li et al., 30 Jun 2025)
Channels	$C=32$ at each scale	$C=64$
Window size	Pooling window $4 \times 4$ attention	$M=8$ (non-overlapping)
Attention type	Local dot-product with pooling	Multi-head within window
Backbone	PVTv2 + RFB blocks	Not specified
Training	Adam, lr= $10^{-4}$ , batch=14, 30 epochs	Deep unfolding, iterative

SAMs are designed for efficiency: lightweight convolutional or transformer layers, no extra auxiliary supervision, and scalable multi-scale or multi-stage deployment.

7. Empirical Impact and Validation

Empirical ablations underscore the effectiveness of SAM:

In SALI, adding SAM to an image-only baseline improves Dice by $+1.7\%$ on Seen-Hard and $+3.7\%$ on Unseen-Hard subsets of the SUN-SEG benchmark. Full SALI (SAM+LIM) achieves further improvement, and qualitative analysis demonstrates strongly stabilized segmentation despite abrupt camera movements (Hu et al., 19 Jun 2024).
In AFUNet, inclusion of SAM yields a $+0.54$ dB improvement in $\mathrm{PSNR}_\mu$ (43.94→44.48 dB) on Kalantari’s HDR dataset compared to a version without explicit alignment, indicating superior suppression of ghosting and more effective fusion of misaligned exposures (Li et al., 30 Jun 2025).

These results establish the strategic necessity and effectiveness of learned spatial alignment modules as foundational design choices for robust spatiotemporal inference and fusion in deep neural architectures.