AMD-HookNet++: Hybrid CNN-Transformer for Glacier Segmentation

Updated 23 December 2025

The paper introduces AMD-HookNet++, which fuses a Swin-UNet Transformer and a U-Net CNN to improve glacier segmentation in SAR imagery.
It employs an Enhanced Spatial-Channel Attention module for effective feature fusion, leading to superior IoU and HD95 metrics on the CaFFe benchmark.
The framework uses pixel-to-pixel contrastive deep supervision to stabilize training and boost accuracy and efficiency compared to prior models.

AMD-HookNet++ is a hybrid convolutional neural network (CNN)–Transformer framework designed for glacier segmentation and calving front delineation in synthetic aperture radar (SAR) imagery. Building on the two-branch “HookNet” paradigm originally developed for histopathology image segmentation, AMD-HookNet++ integrates Transformer-based global context modeling with CNN-driven local feature extraction. Its architecture introduces a feature fusion mechanism based on enhanced spatial-channel attention (ESCA) and employs pixel-to-pixel contrastive deep supervision, setting a new state of the art on the CaFFe glacier benchmark (Wu et al., 16 Dec 2025).

1. Principle and Model Architecture

AMD-HookNet++ generalizes the original two-branch HookNet design by pairing a Swin-UNet Transformer for context with a U-Net CNN for high-resolution target features. Each branch processes a different spatial resolution derived from the same 224×224×3 SAR input patch: the context branch receives a downsampled “coarse” view, and the target branch a center-cropped, higher-resolution patch at twice the context resolution ( $r_t = 2 r_c$ ).

Transformer Context Branch (Swin-UNet):

Patch embedding via $4 \times 4$ convolution to $56 \times 56 \times 96$ .
Four Swin Transformer stages implementing alternating window-based (W-MSA) and shifted-window (SW-MSA) self-attention, with residuals and MLPs:

$\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l$

Output reshaped to $224 \times 224 \times 4$ logits.

CNN Target Branch (U-Net):

Nine convolutional blocks: $3 \times 3$ Conv $\rightarrow$ BN $\rightarrow$ ReLU, MaxPool (encoder) or ConvTranspose (decoder).
Channel progression: $32\rightarrow64\rightarrow128\rightarrow256\rightarrow320 \rightarrow256\rightarrow128\rightarrow64\rightarrow32$ .
Final $1\times1$ convolution for four output semantic classes.

Feature Exchange and Matching:

At two depths (Swin stages 6–7 and CNN conv-blocks 5–6), features are aligned using cropping or up/downsampling to ensure spatial correspondence before fusion via ESCA.

2. Enhanced Spatial-Channel Attention (ESCA) Fusion

The ESCA module enables expressive and adaptive fusion of features between the Transformer and CNN branches beyond simple concatenation or single-head self-attention.

Spatial Attention:

Crop context features, concatenate with target features, yielding tensor $4 \times 4$ 0.
Channel-independent $4 \times 4$ 1 generated by depth-wise convolutions:

$4 \times 4$ 2

Attention-weighted combination:

$4 \times 4$ 3

with $4 \times 4$ 4 learnable (residual).

Channel Attention:

Reshape $4 \times 4$ 5 to $4 \times 4$ 6.
Learn spatially independent affinity $4 \times 4$ 7.
Channelwise refinement: $4 \times 4$ 8, $4 \times 4$ 9.

Final Fusion:

$56 \times 56 \times 96$ 0 convolution restores the hooked feature:

$56 \times 56 \times 96$ 1

LayerNorm and ReLU applied within convolutional steps.

ESCA consistently outperforms CBAM and single-head fusions by more effectively leveraging both spatial and channel interactions, yielding increased IoU (+0.6%) and improved Hausdorff distance compared to baseline fusions.

3. Pixel-to-Pixel Contrastive Deep Supervision

AMD-HookNet++ exploits deep supervision at intermediate decoder levels using a pixelwise contrastive (NCE-style) loss. Denoting pixel $56 \times 56 \times 96$ 2’s embedding as $56 \times 56 \times 96$ 3, the contrastive loss at decoder depths $56 \times 56 \times 96$ 4 is:

$56 \times 56 \times 96$ 5

where $56 \times 56 \times 96$ 6/ $56 \times 56 \times 96$ 7 are positive/negative sets, and $56 \times 56 \times 96$ 8 is a temperature parameter.

Total loss:

$56 \times 56 \times 96$ 9

with $\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l$ 0, $\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l$ 1. Here, $\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l$ 2 and $\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l$ 3 denote logits/labels for target and context outputs, respectively.

Contrastive supervision encourages discriminative intermediate embeddings, empirically stabilizing training and reducing mean distance error (MDE) by approximately 13.6%.

4. Training Protocol and Implementation

Data and Preprocessing:

Dataset: CaFFe (681 SAR images, 7 glaciers spanning Greenland, Antarctic Peninsula, Alaska), split into 559 train and 122 test images. Annotations span “ocean + ice-melange”, “rock outcrop”, “glacier”, and “NA-area.”
Input: Patches of $\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l$ 4; target patches use sliding window, context from downsampling.
Augmentation: Random rotations to [0,360°], flips (p=0.5), standard normalization.

Optimization:

Optimizer: SGD, momentum 0.9, weight decay $\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l$ 5.
Learning rate: initial 0.01, exponential decay (0.9 per epoch).
130 epochs, batch size 170.
Context branch (Swin-UNet) initialized from ImageNet-pretrained weights.
PyTorch implementation on single NVIDIA A100 GPU.

5. Empirical Performance

“Zones” Segmentation (mean ± std over five runs):

Precision: $\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l$ 6
Recall: $\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l$ 7
F1-score: $\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l$ 8
IoU: $\hat{z}^l = \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}, \; z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l$ 9 (up 2.7% vs. HookFormer, 3.8% vs. AMD-HookNet, 8.5% vs. CaFFe baseline)

Calving Front Delineation:

MDE: $224 \times 224 \times 4$ 0 (on par with HookFormer’s $224 \times 224 \times 4$ 1)
HD95: $224 \times 224 \times 4$ 2 ( $224 \times 224 \times 4$ 3 3.8% vs. HookFormer’s $224 \times 224 \times 4$ 4)
Zero missed-front cases ( $224 \times 224 \times 4$ 5 test images)

Summary Table: Calving Front Metrics

Method	IoU (%)	MDE (m)	HD95 (m)
CaFFe baseline	69.7	753	2180
AMD-HookNet (CNN)	74.4	438	1631
HookFormer (ViT)	75.5	353	1370
Trans-UNet (hybrid)	66.0	574	1836
AMD-HookNet++	78.2	367	1318

Qualitatively, AMD-HookNet++ produces smoother and more geophysically plausible calving front delineations than pure-Transformer methods, which are prone to jittery, jagged boundaries in noisy SAR imagery.

6. Analysis and Limitations

The combination of Transformer branch (global context) and CNN branch (local continuity) accounts for the significant segmentation improvements. ESCA avoids uniform attention limitations of CBAM or single-head self-attention by decoupling spatial and channel recalibration, and contributes to superior IoU and HD95.

Contrastive deep supervision ensures intermediate pixel embeddings are more class-discriminative, improving both MDE and training stability.

Despite incurring greater computational cost (22 GFLOPs vs. 15 GFLOPs for HookFormer), AMD-HookNet++ achieves higher throughput (165 img/s vs. 91 img/s) due to efficient CNN decoding. The primary limitation is reliance on ImageNet pretraining, suggesting a need for large-scale, domain-specific foundation models for SAR applications.

7. Extensions and Prospective Directions

Prospective research avenues include:

Extending ESCA fusion architecture to other dual-branch domains (e.g., multi-modal fusion in remote sensing).
Pretraining on large-scale unlabeled SAR corpora using self-supervised learning objectives (e.g., SSL4SAR).
Adapting AMD-HookNet++ for multi-class glacier zone segmentation and processing of three-dimensional radar echo data.

These developments indicate the method’s broad relevance for SAR-based environmental monitoring and geophysical analysis, beyond glacier segmentation (Wu et al., 16 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

AMD-HookNet++: Evolution of AMD-HookNet with Hybrid CNN-Transformer Feature Enhancement for Glacier Calving Front Segmentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AMD-HookNet++.