Papers
Topics
Authors
Recent
2000 character limit reached

Modality-Agnostic Deformable Attention

Updated 16 November 2025
  • Modality-Agnostic Deformable Attention is a parameter-free, locality-constrained method that retrieves continuous pixel-level displacements for robust image registration.
  • It utilizes multi-resolution CNN feature extraction with a coarse-to-fine strategy to optimize both intra- and inter-modal alignments efficiently.
  • Experimental evaluations show superior performance in Dice scores and target registration error compared to conventional methods on diverse imaging datasets.

Modality-Agnostic Deformable Attention is a parameter-free attention mechanism for pixel-level correspondence retrieval in deformable image registration, independent of the input modality. This technique, embodied in the Vector Field Attention (VFA) framework, utilizes multi-resolution convolutional neural networks for feature extraction and a locality-constrained attention module that directly produces continuous displacement fields via attention-weighted aggregation, without requiring learnable parameters in the matching stage. VFA is end-to-end differentiable and compatible with both intra- and inter-modality registration problems, excelling in accuracy and computational efficiency relative to baseline methods.

1. Theoretical Foundation and Framework Architecture

The VFA framework adopts a three-stage decomposition for deformable image registration:

  1. Feature Extraction: For input fixed (IfI_f) and moving (ImI_m) images defined on ΩRd\Omega \subset \mathbb{R}^d, two parallel multi-resolution U-Net-style CNNs extract feature maps at LL scales (l=1l=1 for finest, l=Ll=L for coarsest), yielding FflRHl×Wl×(Dl)×ClF^l_f \in \mathbb{R}^{H_l \times W_l \times (D_l) \times C_l} and FmlF^l_m of matching dimensions. U-Net weights are shared for intra-modal registration and independent for inter-modal.
  2. Feature Matching (Attention): For each discrete voxel xZd\mathbf{x} \in \mathbb{Z}^d at scale ll, a local attention window W(x)={x+δδ{1,0,1}d}\mathcal{W}(\mathbf{x}) = \{\mathbf{x} + \bm{\delta} \mid \bm{\delta} \in \{-1,0,1\}^d\} is established.
  3. Location Retrieval (Vector-Field Assembly): Displacement vectors are retrieved as attention-weighted summations of fixed offsets, forming a dense continuous vector field.

A coarse-to-fine strategy is used: the displacement ϕl+1\phi^{l+1} estimated at a coarser scale is upsampled and the moving image feature maps are warped accordingly before processing each finer scale.

2. Attention Module and Mechanism

The modality-agnostic attention module is parameter-free and consists of the following steps:

  • Query and Key Construction:

At voxel x\mathbf{x} in FflF^l_f, query vector Q(x)=Ffl(x)Q(\mathbf{x})=F^l_f(\mathbf{x}); keys K(x+δ)=Fml(x+δ)K(\mathbf{x}+\bm{\delta})=F^l_m(\mathbf{x}+\bm{\delta}) are collected within W(x)\mathcal{W}(\mathbf{x}).

  • Value Definition:

Value matrix RR provides fixed displacement vectors R(δ)=δR(\bm{\delta}) = -\bm{\delta}, indexed over all neighbors.

  • Scaled Dot-Product Attention:

S(x,δ)=Q(x)K(x+δ)ClS(\mathbf{x}, \bm{\delta}) = \frac{Q(\mathbf{x}) K(\mathbf{x}+\bm{\delta})^{\top}}{\sqrt{C_l}}

These scores are normalized with softmax:

A(x,δ)=exp(S(x,δ))δexp(S(x,δ))A(\mathbf{x},\bm{\delta}) = \frac{\exp(S(\mathbf{x},\bm{\delta}))}{\sum_{\bm{\delta}'}\exp(S(\mathbf{x},\bm{\delta}'))}

  • Displacement Calculation:

The attention-weighted displacement for position x\mathbf{x}:

ul(x)=δ{1,0,1}dA(x,δ)R(δ)\mathbf{u}^l(\mathbf{x}) = \sum_{\bm{\delta}\in\{-1,0,1\}^d} A(\mathbf{x},\bm{\delta}) R(\bm{\delta})

yielding continuous, sub-voxel estimates.

Key properties:

  • No learnable parameters in attention; only feature extractor (CNN weights) and scale β\beta are learned.
  • Cosine similarity may replace dot product as similarity kernel, yielding visually more coherent cross-modal correspondence, though with higher GPU memory consumption and no significant change in Dice performance.

3. Continuous Deformation and Regularization

The resultant displacement field is composed across scales and rendered as a continuous deformation:

  • Multi-scale Composition:

At each scale ll, absolute map:

ϕl(x)=x+βul(x)\phi^l(\mathbf{x}) = \mathbf{x} + \beta\, \mathbf{u}^l(\mathbf{x})

Composed with upsampled ϕl+1\phi^{l+1} for hierarchical refinement.

  • Smoothness Regularization:

Diffusion penalty encourages spatial smoothness:

LReg=xi=1diu1(x)22\mathcal{L}_{\rm Reg} = \sum_{\mathbf{x}}\sum_{i=1}^d \|\nabla_i \mathbf{u}^1(\mathbf{x})\|^2_2

  • Warped Image Rendering:

The final deformation ϕ1\phi^1 governs image warping via a differentiable grid sampler:

Iw(x)=Im(ϕ1(x))I_w(\mathbf{x}) = I_m(\phi^1(\mathbf{x}))

4. Training Methodology and Modal-Agnostic Loss Functions

VFA supports unsupervised, semi-supervised, and weakly supervised training regimes:

  • Unsupervised Loss Function:

L=LSim(If,Imϕ1)+λLReg(u1)\mathcal{L} = \mathcal{L}_{\rm Sim}(I_f, I_m \circ \phi^1) + \lambda \mathcal{L}_{\rm Reg}(\mathbf{u}^1)

  • LSim\mathcal{L}_{\rm Sim}: for intra-modal uses normalized cross-correlation (NCC); for inter-modal uses mutual information (MI); or for CT lung, mean-squared error (MSE).
    • (Weakly) Supervised Loss Terms:

Provided anatomical labels (LfcL^c_f, LmcL^c_m) or landmarks (fj\ell^j_f, mj\ell^j_m), the losses include:

LDice=cDice(Lfc,Lmcϕ1)\mathcal{L}_{\rm Dice} = -\sum_c \mathrm{Dice}(L^c_f, L^c_m \circ \phi^1)

LTRE=1Nj=1Nϕ1(mj)fj22\mathcal{L}_{\rm TRE} = \frac{1}{N}\sum_{j=1}^N \|\phi^1(\ell^j_m) - \ell^j_f\|_2^2

The total loss integrates these with appropriate weights.

5. Experimental Results and Performance Comparison

Empirical evaluations demonstrate VFA's superior performance across datasets and modalities:

Dataset Loss/Setting VFA DSC Baseline DSCs
IXI T1w atlas→subject NCC+diffusion, λ=1 0.806±0.012 Im2grid 0.792±0.012; TransMorph 0.774±0.029; VoxelMorph 0.726±0.048
T2w→T1w MR (inter-modal) MI, λ=0.2 0.725±0.022 DMR 0.671±0.038; TransMorph 0.660±0.044; Im2grid 0.668±0.025
Learn2Reg 2021 (OASIS, weakly supervised) Dice, HD95, SDLogJ DSC 0.834, HD95 1.66mm, SDLogJ 0.234 Best DSC among all entrants
Learn2Reg 2022 lung CT (semi-supervised) TRE, TRE30 TRE 1.705mm, TRE30 2.311mm Among top 3, best TRE30 overall

Non-diffeomorphic voxels remain below 0.1% in intra-modal tasks.

6. Algorithmic Ablations and Modality-Agnostic Characteristics

VFA's design decouples feature extraction from spatial matching, yielding generality across imaging contrasts:

  • Intra-modal registration: Shared CNN weights; NCC loss
  • Inter-modal registration: Independent CNNs; MI loss

Ablations reveal:

  • Replacing dot-product similarity with cosine similarity enhances visual coherence in feature matching across modalities, with minimal effect on Dice coefficient.
  • VFA retains accuracy with half-width U-Net; baseline architectures do not match VFA with increased capacity.

This suggests VFA's parameter-free attention can robustly integrate with feature extractors, and generalizes across domains without modality-specific tuning.

7. Implementation Protocol

End-to-end differentiable computation enables direct integration with modern deep learning workflows. The following pseudocode summarizes the multi-scale inference and training loop for VFA:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Input: I_f, I_m
Params: CNN_f, CNN_m  (U-Nets), β, λ, other loss weights
for l = L1 do
  # 1) extract or warp features
  F_f^l  CNN_f features at level l from I_f
  if l == L:
    M_warp^l  CNN_m features at level l from I_m
  else:
    M_warp^l  warp(CNN_m^l(I_m), φ^{l+1})
  # 2) option: conv to adjust channels
  F_f^l, M_warp^l  Conv(F_f^l), Conv(M_warp^l)
  # 3) attention‐matching on every x
  for each voxel x in F_f^l:
    Q = F_f^l(x)                          # 1×C
    collect K_δ = M_warp^l(x+δ) for δ{-1,0,1}^d
    compute scores s_δ = (Q·K_δ)/sqrt(C)
    A_δ = softmax_δ(s_δ)
    u^l(x) = Σ_δ A_δ·(δ)
  end
  φ^l(x) = x + β·u^l(x)
  if l<L: φ^l  compose(φ^l, upsample(φ^{l+1}))
end
I_w = grid_sampler(I_m, φ^1)
Compute total loss L = Sim(I_f, I_w) + λ Reg(u^1) [+ supervised losses]
Backprop and update CNN_f, CNN_m, β

VFA's locality-constrained, modality-agnostic deformable attention paradigm is extensible to both supervised and unsupervised workflows, and can be plugged onto arbitrary feature extractors for registration tasks in medical imaging and beyond.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Modality-Agnostic Deformable Attention.