Siamese Swin Transformer with Dual Cross-Attention

Updated 10 December 2025

The paper demonstrates that SSDCA significantly improves balanced accuracy and sensitivity for tumor regrowth classification through direct cross-modal fusion.
SSDCA employs a Siamese architecture with dual cross-attention that aligns features without explicit registration, enabling robust integration of pre- and post-treatment images or PET/CT volumes.
Empirical results reveal enhanced segmentation and classification performance, with increased Dice scores and discriminative feature representations in complex imaging environments.

The Siamese Swin Transformer with Dual Cross-Attention (SSDCA) is a neural architecture that integrates twin Swin Transformer backbones with a dual cross-attention (DCA) module, enabling direct context-aware fusion of information from paired images or volumes. Originally pioneered for longitudinal tumor regrowth assessment in rectal endoscopy and multimodal tumor segmentation in head-and-neck oncology, SSDCA systematically emphasizes correspondences across time or modality, overcoming the need for image registration and enhancing both discriminative representation and resilience to severe imaging variabilities (Gomez et al., 3 Dec 2025, Li et al., 2023).

1. Architectural Foundation

SSDCA employs a two-branch ("Siamese") topology wherein each branch processes a distinct input—such as pre- and post-treatment endoscopic images (Gomez et al., 3 Dec 2025), or CT and PET image volumes (Li et al., 2023). Each branch leverages a hierarchical Swin Transformer backbone, characterized by window-based multi-head self-attention (W-MSA), patch embedding, and progressive spatial downsampling. For 2D tasks (rectal endoscopy), the shared Swin-Small backbone utilizes:

Input resolution: $224\times224$
Patch size: $4\times4$
Window size: $7\times7$
4 hierarchical stages with patch merging, reducing $H,W$ by $2\times$ per stage

For 3D segmentation (head-and-neck PET/CT), each branch begins with a $3\times3\times3$ convolutional embedder ( $C=48$ ), followed by 4 Swin stages with 3D windowed attention and patch merging. Dual-branch encoding in both settings results in parallel deep feature representations suitable for cross-contextual fusion via the DCA mechanism.

2. Dual Cross-Attention Mechanism

The DCA module is applied to the stage-4 output feature maps (or at multiple Swin stages for 3D segmentation), producing cross-attended features that encode mutual context between the two inputs. At the feature fusion point:

Let $F_{pre}, F_{post} \in \mathbb{R}^{M \times C}$ denote the flattened features from each branch.
Learned projections $W_q, W_k, W_v$ generate $Q, K, V$ matrices per branch and per attention head.

The DCA computes bidirectional cross-attention as: $\begin{aligned} Q_{pre} &= F_{pre} W_q,\quad K_{post} = F_{post} W_k,\quad V_{post} = F_{post} W_v \ \mathrm{CA}_{pre\leftarrow post}(F_{pre}) &= \mathrm{Softmax}\left(\frac{Q_{pre} K_{post}^\top}{\sqrt{D_h}}\right)V_{post} \ Q_{post} &= F_{post} W_q,\quad K_{pre} = F_{pre} W_k,\quad V_{pre} = F_{pre} W_v \ \mathrm{CA}_{post\leftarrow pre}(F_{post}) &= \mathrm{Softmax}\left(\frac{Q_{post} K_{pre}^\top}{\sqrt{D_h}}\right) V_{pre} \end{aligned}$ Each cross-attended feature is added residually and layer-normalized: $\begin{aligned} H_{pre} &= \mathrm{LN}\bigl(F_{pre} + \mathrm{CA}_{pre\leftarrow post}(F_{pre})\bigr) \ H_{post} &= \mathrm{LN}\bigl(F_{post} + \mathrm{CA}_{post\leftarrow pre}(F_{post})\bigr) \end{aligned}$ This bidirectional linking enables the model to focus attention on relevant regions in either input, without explicit alignment (Gomez et al., 3 Dec 2025, Li et al., 2023).

3. Feature Aggregation and Prediction

For classification, each DCA-refined feature map undergoes global average pooling (GAP) to yield $h_{pre}, h_{post} \in \mathbb{R}^{C}$ , which are concatenated into a single vector $h = [h_{pre}; h_{post}] \in \mathbb{R}^{2C}$ . This vector is fed into a classification head: $p = \sigma\left(W_2\,\mathrm{Dropout}\left(\mathrm{ReLU}(W_1 h)\right)\right)$ where $W_1 \in \mathbb{R}^{D_{fc}\times 2C}$ , $W_2 \in \mathbb{R}^{1 \times D_{fc}}$ , and $\sigma$ denotes sigmoid activation, yielding the class probability.

For segmentation tasks, multi-scale DCA fusion features are passed through a U-Net-like convolutional decoder, incorporating skip connections and upsampling layers to reconstruct the spatial segmentation map (Li et al., 2023).

4. Training Protocols and Data Regimes

Key aspects include:

Optimization: Adam (classification; learning rate $2\times10^{-4}$ , batch size 8) (Gomez et al., 3 Dec 2025), AdamW (segmentation; learning rate $1\times10^{-4}$ , batch size 1, poly decay) (Li et al., 2023).
Loss: Binary cross-entropy for classification; composite Dice + cross-entropy for segmentation.
Augmentation: Extensive geometric and intensity-based augmentations are employed (rotations, flips for 2D; 3D rotations, scaling, mirroring, noise, gamma for 3D).
Evaluation: Five-fold cross-validation with stratified splits. Classification uses balanced sampling per mini-batch (Gomez et al., 3 Dec 2025).

Datasets

Rectal endoscopy: 135 training patients ( $\sim$ 7,392 pairs), 62 held-out test patients ( $\sim$ 368 pairs) (Gomez et al., 3 Dec 2025).
Head-and-neck segmentation: 224 patients (HECKTOR 2021), with 5-fold CV (Li et al., 2023).

5. Empirical Performance and Ablations

Classification (Rectal Tumor Regrowth)

SSDCA achieves superior aggregate results compared with Swin-Single-Image (Swin-S SI) and simple stage-4 feature concatenation (SSFC):

Model	Balanced Acc.	Sensitivity	Specificity
Swin-S SI	76.24% ± 0.02	65.32% ± 0.09	87.14% ± 0.07
SSFC (concat)	81.13% ± 0.06	84.00% ± 0.13	78.57% ± 0.12
SSDCA (DCA)	81.76% ± 0.05	90.07% ± 0.08	72.86% ± 0.05

SSDCA exhibits statistically significant improvements in balanced accuracy (p < 0.01) and outperforms in sensitivity.

Segmentation (Head-and-Neck PET/CT)

Model	5-fold Dice
UNETR (PET+CT)	0.723 ± 0.035
Swin UNETR (PET+CT)	0.754 ± 0.032
nnU-Net (PET+CT)	0.767 ± 0.029
SSDCA (DCA)	0.769 ± 0.026

Ablation studies reveal that omitting cross-attention by simple channel-wise concatenation results in Dice=0.754; integrating dual DCA at each Swin stage raises performance to Dice=0.769 (+0.015). DCA application at earlier stages in the 2D use case did not provide further benefits (Gomez et al., 3 Dec 2025), while in 3D segmentation, multi-stage fusion is standard.

Robustness and Representation

On endoscopic images affected by artifacts (blood, stool, telangiectasia, poor quality), SSDCA maintains stable sensitivity and specificity, with only marginal drops for stool and poor-quality occlusions (Gomez et al., 3 Dec 2025).

Cluster analysis using UMAP demonstrates that SSDCA fosters maximal inter-cluster separation (1.45 ± 0.18) and minimal intra-cluster dispersion (1.07 ± 0.19) of GAP feature embeddings, indicative of superior discriminative structure.

6. Context and Relevance

SSDCA's dual cross-attention mechanism draws direct ancestry from prior cross-modal Swin architectures such as SwinCross, which established the integration of cross-attention at multiple hierarchical Swin stages for PET/CT fusion (Li et al., 2023). The primary innovation in SSDCA is its adaptation to the Siamese setting for temporal or paired analysis, leveraging DCA to robustly match feature correspondences across spatially unaligned scans or disparate modalities.

Use of pretrained Swin backbones enables SSDCA to extract domain-agnostic features, enhancing resistance to imaging noise and variability (Gomez et al., 3 Dec 2025). Its explicit contextual fusion at multiple resolutions is demonstrated to benefit both large- and small-scale lesion detection.

A plausible implication is broad applicability to diverse multi-temporal or multi-modal scenarios beyond endoscopic cancer monitoring and PET/CT segmentation.

7. Broader Implications and Directions

SSDCA's results suggest a systematic strategy for dual-branch Transformer fusion, effectively capturing cross-input dependencies without spatial registration—a feature essential for clinical domains with high intra- and inter-patient variability. Its competitive performance and domain-adaptive robustness indicate strong promise for future multimodal and longitudinal image analysis tasks.

Further extensions could address scalability to larger spatial and temporal contexts, integration with clinical metadata, or hybridization with attention-based meta-learning frameworks. The evolution of DCA modules may include adaptive weighting or self-supervised attention guidance to further enhance feature discrimination and generalization across domains (Gomez et al., 3 Dec 2025, Li et al., 2023).

PDF Markdown Chat (Pro)

References (2)

Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy (2025)

SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor Segmentation in PET/CT Images (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Siamese Swin Transformer with Dual Cross-Attention (SSDCA).