AMD-HookNet++: Hybrid CNN-Transformer for Glacier Segmentation
- The paper introduces AMD-HookNet++, which fuses a Swin-UNet Transformer and a U-Net CNN to improve glacier segmentation in SAR imagery.
- It employs an Enhanced Spatial-Channel Attention module for effective feature fusion, leading to superior IoU and HD95 metrics on the CaFFe benchmark.
- The framework uses pixel-to-pixel contrastive deep supervision to stabilize training and boost accuracy and efficiency compared to prior models.
AMD-HookNet++ is a hybrid convolutional neural network (CNN)–Transformer framework designed for glacier segmentation and calving front delineation in synthetic aperture radar (SAR) imagery. Building on the two-branch “HookNet” paradigm originally developed for histopathology image segmentation, AMD-HookNet++ integrates Transformer-based global context modeling with CNN-driven local feature extraction. Its architecture introduces a feature fusion mechanism based on enhanced spatial-channel attention (ESCA) and employs pixel-to-pixel contrastive deep supervision, setting a new state of the art on the CaFFe glacier benchmark (Wu et al., 16 Dec 2025).
1. Principle and Model Architecture
AMD-HookNet++ generalizes the original two-branch HookNet design by pairing a Swin-UNet Transformer for context with a U-Net CNN for high-resolution target features. Each branch processes a different spatial resolution derived from the same 224×224×3 SAR input patch: the context branch receives a downsampled “coarse” view, and the target branch a center-cropped, higher-resolution patch at twice the context resolution ().
Transformer Context Branch (Swin-UNet):
- Patch embedding via convolution to .
- Four Swin Transformer stages implementing alternating window-based (W-MSA) and shifted-window (SW-MSA) self-attention, with residuals and MLPs:
- Output reshaped to logits.
CNN Target Branch (U-Net):
- Nine convolutional blocks: Conv BN ReLU, MaxPool (encoder) or ConvTranspose (decoder).
- Channel progression: .
- Final convolution for four output semantic classes.
Feature Exchange and Matching:
- At two depths (Swin stages 6–7 and CNN conv-blocks 5–6), features are aligned using cropping or up/downsampling to ensure spatial correspondence before fusion via ESCA.
2. Enhanced Spatial-Channel Attention (ESCA) Fusion
The ESCA module enables expressive and adaptive fusion of features between the Transformer and CNN branches beyond simple concatenation or single-head self-attention.
Spatial Attention:
- Crop context features, concatenate with target features, yielding tensor .
- Channel-independent generated by depth-wise convolutions:
- Attention-weighted combination:
with learnable (residual).
Channel Attention:
- Reshape to .
- Learn spatially independent affinity .
- Channelwise refinement: , .
Final Fusion:
- convolution restores the hooked feature:
- LayerNorm and ReLU applied within convolutional steps.
ESCA consistently outperforms CBAM and single-head fusions by more effectively leveraging both spatial and channel interactions, yielding increased IoU (+0.6%) and improved Hausdorff distance compared to baseline fusions.
3. Pixel-to-Pixel Contrastive Deep Supervision
AMD-HookNet++ exploits deep supervision at intermediate decoder levels using a pixelwise contrastive (NCE-style) loss. Denoting pixel ’s embedding as , the contrastive loss at decoder depths is:
where / are positive/negative sets, and is a temperature parameter.
Total loss:
with , . Here, and denote logits/labels for target and context outputs, respectively.
Contrastive supervision encourages discriminative intermediate embeddings, empirically stabilizing training and reducing mean distance error (MDE) by approximately 13.6%.
4. Training Protocol and Implementation
Data and Preprocessing:
- Dataset: CaFFe (681 SAR images, 7 glaciers spanning Greenland, Antarctic Peninsula, Alaska), split into 559 train and 122 test images. Annotations span “ocean + ice-melange”, “rock outcrop”, “glacier”, and “NA-area.”
- Input: Patches of ; target patches use sliding window, context from downsampling.
- Augmentation: Random rotations to [0,360°], flips (p=0.5), standard normalization.
Optimization:
- Optimizer: SGD, momentum 0.9, weight decay .
- Learning rate: initial 0.01, exponential decay (0.9 per epoch).
- 130 epochs, batch size 170.
- Context branch (Swin-UNet) initialized from ImageNet-pretrained weights.
- PyTorch implementation on single NVIDIA A100 GPU.
5. Empirical Performance
“Zones” Segmentation (mean ± std over five runs):
- Precision:
- Recall:
- F1-score:
- IoU: (up 2.7% vs. HookFormer, 3.8% vs. AMD-HookNet, 8.5% vs. CaFFe baseline)
Calving Front Delineation:
- MDE: (on par with HookFormer’s )
- HD95: ( 3.8% vs. HookFormer’s )
- Zero missed-front cases ($0/122$ test images)
Summary Table: Calving Front Metrics
| Method | IoU (%) | MDE (m) | HD95 (m) |
|---|---|---|---|
| CaFFe baseline | 69.7 | 753 | 2180 |
| AMD-HookNet (CNN) | 74.4 | 438 | 1631 |
| HookFormer (ViT) | 75.5 | 353 | 1370 |
| Trans-UNet (hybrid) | 66.0 | 574 | 1836 |
| AMD-HookNet++ | 78.2 | 367 | 1318 |
Qualitatively, AMD-HookNet++ produces smoother and more geophysically plausible calving front delineations than pure-Transformer methods, which are prone to jittery, jagged boundaries in noisy SAR imagery.
6. Analysis and Limitations
The combination of Transformer branch (global context) and CNN branch (local continuity) accounts for the significant segmentation improvements. ESCA avoids uniform attention limitations of CBAM or single-head self-attention by decoupling spatial and channel recalibration, and contributes to superior IoU and HD95.
Contrastive deep supervision ensures intermediate pixel embeddings are more class-discriminative, improving both MDE and training stability.
Despite incurring greater computational cost (22 GFLOPs vs. 15 GFLOPs for HookFormer), AMD-HookNet++ achieves higher throughput (165 img/s vs. 91 img/s) due to efficient CNN decoding. The primary limitation is reliance on ImageNet pretraining, suggesting a need for large-scale, domain-specific foundation models for SAR applications.
7. Extensions and Prospective Directions
Prospective research avenues include:
- Extending ESCA fusion architecture to other dual-branch domains (e.g., multi-modal fusion in remote sensing).
- Pretraining on large-scale unlabeled SAR corpora using self-supervised learning objectives (e.g., SSL4SAR).
- Adapting AMD-HookNet++ for multi-class glacier zone segmentation and processing of three-dimensional radar echo data.
These developments indicate the method’s broad relevance for SAR-based environmental monitoring and geophysical analysis, beyond glacier segmentation (Wu et al., 16 Dec 2025).