Papers
Topics
Authors
Recent
2000 character limit reached

AMD-HookNet++: Hybrid CNN-Transformer for Glacier Segmentation

Updated 23 December 2025
  • The paper introduces AMD-HookNet++, which fuses a Swin-UNet Transformer and a U-Net CNN to improve glacier segmentation in SAR imagery.
  • It employs an Enhanced Spatial-Channel Attention module for effective feature fusion, leading to superior IoU and HD95 metrics on the CaFFe benchmark.
  • The framework uses pixel-to-pixel contrastive deep supervision to stabilize training and boost accuracy and efficiency compared to prior models.

AMD-HookNet++ is a hybrid convolutional neural network (CNN)–Transformer framework designed for glacier segmentation and calving front delineation in synthetic aperture radar (SAR) imagery. Building on the two-branch “HookNet” paradigm originally developed for histopathology image segmentation, AMD-HookNet++ integrates Transformer-based global context modeling with CNN-driven local feature extraction. Its architecture introduces a feature fusion mechanism based on enhanced spatial-channel attention (ESCA) and employs pixel-to-pixel contrastive deep supervision, setting a new state of the art on the CaFFe glacier benchmark (Wu et al., 16 Dec 2025).

1. Principle and Model Architecture

AMD-HookNet++ generalizes the original two-branch HookNet design by pairing a Swin-UNet Transformer for context with a U-Net CNN for high-resolution target features. Each branch processes a different spatial resolution derived from the same 224×224×3 SAR input patch: the context branch receives a downsampled “coarse” view, and the target branch a center-cropped, higher-resolution patch at twice the context resolution (rt=2rcr_t = 2 r_c).

Transformer Context Branch (Swin-UNet):

  • Patch embedding via 4×44 \times 4 convolution to 56×56×9656 \times 56 \times 96.
  • Four Swin Transformer stages implementing alternating window-based (W-MSA) and shifted-window (SW-MSA) self-attention, with residuals and MLPs:

z^l=W-MSA(LN(zl1))+zl1,  zl=MLP(LN(z^l))+z^l\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l

  • Output reshaped to 224×224×4224 \times 224 \times 4 logits.

CNN Target Branch (U-Net):

  • Nine convolutional blocks: 3×33 \times 3 Conv \rightarrow BN \rightarrow ReLU, MaxPool (encoder) or ConvTranspose (decoder).
  • Channel progression: 3264128256320256128643232\rightarrow64\rightarrow128\rightarrow256\rightarrow320 \rightarrow256\rightarrow128\rightarrow64\rightarrow32.
  • Final 1×11\times1 convolution for four output semantic classes.

Feature Exchange and Matching:

  • At two depths (Swin stages 6–7 and CNN conv-blocks 5–6), features are aligned using cropping or up/downsampling to ensure spatial correspondence before fusion via ESCA.

2. Enhanced Spatial-Channel Attention (ESCA) Fusion

The ESCA module enables expressive and adaptive fusion of features between the Transformer and CNN branches beyond simple concatenation or single-head self-attention.

Spatial Attention:

  • Crop context features, concatenate with target features, yielding tensor MRH×W×CM\in\mathbb{R}^{H'\times W'\times C''}.
  • Channel-independent q,k,vq, k, v generated by depth-wise convolutions:

q,k,v=DwConv(M)q,k,v = \mathrm{DwConv}(M)

  • Attention-weighted combination:

As=Softmax(qk/d)v;M~=M+θAsA_s = \mathrm{Softmax}(q k^\top/\sqrt{d}) v;\quad \widetilde{M} = M + \theta A_s

with θ\theta learnable (residual).

Channel Attention:

  • Reshape M~\widetilde{M} to HW×CH'W'\times C''.
  • Learn spatially independent affinity URHW×CU\in\mathbb{R}^{H'W'\times C''}.
  • Channelwise refinement: U=Softmax(U,dim=C)U' = \mathrm{Softmax}(U, \mathrm{dim}=C''), Ac=U×M~A_c = U' \times \widetilde{M}.

Final Fusion:

  • 1×11\times1 convolution restores the hooked feature:

ESCA_hook(Fc,Ft)=Conv1×1(Ac)\mathrm{ESCA\_hook}(F_c, F_t) = \mathrm{Conv}_{1\times1}(A_c)

  • LayerNorm and ReLU applied within convolutional steps.

ESCA consistently outperforms CBAM and single-head fusions by more effectively leveraging both spatial and channel interactions, yielding increased IoU (+0.6%) and improved Hausdorff distance compared to baseline fusions.

3. Pixel-to-Pixel Contrastive Deep Supervision

AMD-HookNet++ exploits deep supervision at intermediate decoder levels using a pixelwise contrastive (NCE-style) loss. Denoting pixel ii’s embedding as ii, the contrastive loss at decoder depths D=1,2D=1,2 is:

LiNCE=1Pii+Pilogexp(ii+/τ)exp(ii+/τ)+iNiexp(ii/τ)\mathcal{L}_i^{\mathrm{NCE}} = -\frac{1}{|\mathcal{P}_i|}\sum_{i^+\in\mathcal{P}_i} \log\frac{\exp(i\cdot i^+/\tau)}{\exp(i\cdot i^+/\tau) + \sum_{i^-\in\mathcal{N}_i} \exp(i\cdot i^-/\tau)}

where Pi\mathcal{P}_i/Ni\mathcal{N}_i are positive/negative sets, and τ\tau is a temperature parameter.

Total loss:

L=λ1[CE(pt,yt)+Dice(pt,yt)]+λ2[CE(pc,yc)+Dice(pc,yc)]+λ3Lcds\mathcal{L} = \lambda_1 [\mathrm{CE}(p_t, y_t) + \mathrm{Dice}(p_t, y_t)] + \lambda_2 [\mathrm{CE}(p_c, y_c) + \mathrm{Dice}(p_c, y_c)] + \lambda_3 \mathcal{L}_{\mathrm{cds}}

with λ1=λ2=1\lambda_1 = \lambda_2 = 1, λ3=0.5\lambda_3 = 0.5. Here, (pt,yt)(p_t, y_t) and (pc,yc)(p_c, y_c) denote logits/labels for target and context outputs, respectively.

Contrastive supervision encourages discriminative intermediate embeddings, empirically stabilizing training and reducing mean distance error (MDE) by approximately 13.6%.

4. Training Protocol and Implementation

Data and Preprocessing:

  • Dataset: CaFFe (681 SAR images, 7 glaciers spanning Greenland, Antarctic Peninsula, Alaska), split into 559 train and 122 test images. Annotations span “ocean + ice-melange”, “rock outcrop”, “glacier”, and “NA-area.”
  • Input: Patches of 224×224224\times224; target patches use sliding window, context from downsampling.
  • Augmentation: Random rotations to [0,360°], flips (p=0.5), standard normalization.

Optimization:

  • Optimizer: SGD, momentum 0.9, weight decay 1×1041\times10^{-4}.
  • Learning rate: initial 0.01, exponential decay (0.9 per epoch).
  • 130 epochs, batch size 170.
  • Context branch (Swin-UNet) initialized from ImageNet-pretrained weights.
  • PyTorch implementation on single NVIDIA A100 GPU.

5. Empirical Performance

“Zones” Segmentation (mean ± std over five runs):

  • Precision: 87.9±0.2%87.9 \pm 0.2\%
  • Recall: 86.3±0.5%86.3 \pm 0.5\%
  • F1-score: 86.3±0.3%86.3 \pm 0.3\%
  • IoU: 78.2±0.4%78.2 \pm 0.4\% (up 2.7% vs. HookFormer, 3.8% vs. AMD-HookNet, 8.5% vs. CaFFe baseline)

Calving Front Delineation:

  • MDE: 367±30m367 \pm 30\,\mathrm{m} (on par with HookFormer’s 353±16m353 \pm 16\,\mathrm{m})
  • HD95: 1318±115m1\,318 \pm 115\,\mathrm{m} (\downarrow 3.8% vs. HookFormer’s 1370±75m1\,370 \pm 75\,\mathrm{m})
  • Zero missed-front cases ($0/122$ test images)

Summary Table: Calving Front Metrics

Method IoU (%) MDE (m) HD95 (m)
CaFFe baseline 69.7 753 2180
AMD-HookNet (CNN) 74.4 438 1631
HookFormer (ViT) 75.5 353 1370
Trans-UNet (hybrid) 66.0 574 1836
AMD-HookNet++ 78.2 367 1318

Qualitatively, AMD-HookNet++ produces smoother and more geophysically plausible calving front delineations than pure-Transformer methods, which are prone to jittery, jagged boundaries in noisy SAR imagery.

6. Analysis and Limitations

The combination of Transformer branch (global context) and CNN branch (local continuity) accounts for the significant segmentation improvements. ESCA avoids uniform attention limitations of CBAM or single-head self-attention by decoupling spatial and channel recalibration, and contributes to superior IoU and HD95.

Contrastive deep supervision ensures intermediate pixel embeddings are more class-discriminative, improving both MDE and training stability.

Despite incurring greater computational cost (22 GFLOPs vs. 15 GFLOPs for HookFormer), AMD-HookNet++ achieves higher throughput (165 img/s vs. 91 img/s) due to efficient CNN decoding. The primary limitation is reliance on ImageNet pretraining, suggesting a need for large-scale, domain-specific foundation models for SAR applications.

7. Extensions and Prospective Directions

Prospective research avenues include:

  • Extending ESCA fusion architecture to other dual-branch domains (e.g., multi-modal fusion in remote sensing).
  • Pretraining on large-scale unlabeled SAR corpora using self-supervised learning objectives (e.g., SSL4SAR).
  • Adapting AMD-HookNet++ for multi-class glacier zone segmentation and processing of three-dimensional radar echo data.

These developments indicate the method’s broad relevance for SAR-based environmental monitoring and geophysical analysis, beyond glacier segmentation (Wu et al., 16 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to AMD-HookNet++.