Swin-BART Encoder-Decoder for Medical Captioning

Updated 17 November 2025

The paper demonstrates the effectiveness of a unified Swin-BART system with regional attention, achieving a ROUGE of 0.603 and BERTScore of 0.807 on the ROCO dataset.
The system employs hierarchical window-based self-attention in the vision encoder and domain-specific PubMedBERT embeddings in the BART decoder to enhance diagnostic relevance.
Ablation studies confirm that disabling the regional attention module significantly reduces performance, underscoring its critical role in maintaining clinical fidelity.

The Swin-BART encoder-decoder system is a unified framework for automated medical image captioning designed to translate complex radiological images into concise, diagnostically meaningful narratives. The system integrates a hierarchical Swin Transformer vision encoder, a BART-base text decoder enhanced with domain-specific embeddings, and a lightweight regional attention module that selectively amplifies salient regions in the visual input prior to cross-modal fusion. Trained and evaluated on the ROCO dataset, this system achieves state-of-the-art semantic fidelity and interpretability for clinical captioning tasks in CT, MRI, and X-ray modalities.

1. System Architecture

1.1 Swin Transformer Encoder

Input images $I \in \mathbb{R}^{B \times 3 \times 224 \times 224}$ are partitioned into non-overlapping $4 \times 4$ patches. Each patch is linearly embedded, resulting in tokens of dimension 96. The encoder comprises a four-stage hierarchical backbone, where each stage applies Windowed Multi-Head Self-Attention (W-MSA) or Shifted-Window MSA (SW-MSA) over local windows ( $M=7$ ), followed by patch-merging that downsamples spatial resolution by $2\times$ and doubles the channel dimension. Each transformer block includes LayerNorm and a position-wise two-layer feed-forward network (FFN) with GELU activations.

After the final stage, the encoder produces a spatial feature tensor $F_{swin} \in \mathbb{R}^{B \times 1024 \times 7 \times 7}$ , which is flattened across spatial dimensions to yield $F_{flat} \in \mathbb{R}^{B \times 49 \times 1024}$ .

1.2 BART Decoder and Biomedical Embeddings

The decoder adopts BART-base (12 layers) and generates caption tokens auto-regressively. Each layer employs masked self-attention for partially generated captions $y_{<t}$ and cross-attention to encoder features $F_{enc}$ . The decoder leverages 768-dimensional PubMedBERT embeddings instead of BART's default token embeddings to inject biomedical knowledge. These embeddings are frozen for the first epoch then jointly fine-tuned with decoder weights.

Multi-head attention follows the standard formula:

$\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)V$

where for cross-attention, $Q$ derives from decoder states, and $K, V$ from encoder outputs.

2. Regional Attention Module

The regional attention module is interposed between the Swin encoder output and the projection stage. It operates as follows:

Importance Scoring: Compute region-wise importance weights $\alpha \in \mathbb{R}^{B \times 49}$ using a linear layer and softmax:

$\alpha = \mathrm{softmax}(W_a F_{flat} + b_a)$

with $W_a \in \mathbb{R}^{1 \times 1024}$ and $b_a \in \mathbb{R}$ .

Feature Aggregation: Produce attended feature representations:

$F_{att} = \sum_{i=1}^{49} \alpha_i F_{flat}^{(i)}$

where $F_{att} \in \mathbb{R}^{B \times 1024}$ .

Dimensionality Projection and Pooling:

$F_{proj} = F_{att} \, W_p^T$

with $W_p \in \mathbb{R}^{768 \times 1024}$ . Adaptive average pooling yields the final encoder tokens:

$F_{enc} \in \mathbb{R}^{B \times 29 \times 768} = \mathrm{AdaptiveAvgPool}(F_{proj})$

Salient regions with high $\alpha_i$ are emphasized, guiding decoder cross-attention toward diagnostically relevant features.

3. Learning Protocol and Objective

Training minimizes the standard cross-entropy loss over the predicted caption sequence:

$\mathcal{L} = -\sum_{t=1}^T \log P(y_t \mid y_{<t}, F_{enc})$

The AdamW optimizer is employed with a learning rate of $1\times 10^{-5}$ , weight decay $0.01$, and dropout $0.1$ on projection layers. The schedule consists of 5 epochs with early stopping (patience=3), batch size of 8, and fixed random seed (42, three independent runs are reported).

No auxiliary losses are incorporated beyond the main objective function, and model selection is based on mean $\pm$ std performance with $95\%$ confidence intervals.

4. Inference and Decoding Mechanism

Inference utilizes beam search with beam size of 4, length penalty $1.1$, $no\_repeat\_ngram\_size=3$ , and maximum output length of 128 tokens. This configuration ensures a balance between hypothesis diversity and avoidance of verbatim repetition, and discourages overly terse captioning.

A plausible implication is that these decoding constraints align the outputs with clinical reporting style while preventing generic or repetitive phrasing.

5. Quantitative Evaluation Metrics

Performance is benchmarked on the ROCO test split against BLIP2-OPT and ResNet-CNN baselines. All results are mean $\pm$ std over three seeds, with $95\%$ confidence intervals.

Metric	BLIP2-OPT	ResNet-CNN	Proposed (Swin–BART + RA)
ROUGE	0.255±0.006	0.356±0.005	0.603±0.004 [0.595,0.611]
BLEU	0.217±0.009	0.311±0.007	0.257±0.008 [0.241,0.273]
CIDEr	0.231±0.007	0.296±0.006	0.215±0.005 [0.205,0.225]
METEOR	0.092±0.002	0.084±0.003	0.081±0.002 [0.077,0.085]
BERTScore	0.645±0.005	0.623±0.006	0.807±0.003 [0.801,0.813]

The Swin-BART system achieves a substantial relative gain in ROUGE and absolute gain in BERTScore over the strongest baseline. BLEU, CIDEr, and METEOR are competitive, suggesting robust semantic and lexical fidelity.

6. Ablation Studies and Analysis

6.1 Regional Attention Effect

Disabling the regional attention module diminishes ROUGE from 0.603 to 0.538 and BERTScore from 0.807 to 0.762 ( $p<0.01$ ), confirming its critical role in guiding clinically relevant captioning.

6.2 Token Count Sweep

Varying the number of pooled tokens $L$ in $\{10, 20, 29, 40\}$ demonstrates peak ROUGE at $L=29$ (0.603), with marginal drop at adjacent values. This suggests a trade-off between representation granularity and over-smoothing.

6.3 Modality-Specific Results

Modality	ROUGE	BERTScore
CT	0.615±0.005	0.814±0.004
MRI	0.590±0.006	0.798±0.005
X-ray	0.605±0.004	0.804±0.003

Captioning performance is consistent and high across modalities, with CT yielding the strongest scores.

6.4 Statistical Significance

Paired bootstrap (n=1,000) analysis affirms the statistical significance ( $p<0.01$ ) of regional attention versus no-attention across all metrics.

6.5 Qualitative Visualization

Heatmaps overlaying the region-wise attention scores $\alpha_i$ on the original images indicate strong alignment between model focus and annotated pathological regions in clinical reports, reinforcing interpretability and facilitating human-in-the-loop validation.

7. Interpretation and Implications

The Swin-BART encoder-decoder architecture leverages hierarchical windowed vision, biomedical-aware textual decoding, and targeted regional attention to maximize both semantic accuracy and transparency in medical image captioning. The compact design and robust attribution maps suggest practical applicability for clinical report triage, decision support, and safe integration with human oversight in diagnostic pipelines.

A plausible implication is that lightweight regional attention modules, when coupled with domain-adapted token embeddings, provide an effective and interpretable route to clinical fidelity in multimodal generative modeling. Further, the model’s evaluation regime—spanning thorough metric reporting, ablation, and modality breakdown—supports its reliability and potential for deployment in research-centric medical imaging environments.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Swin-BART Encoder-Decoder System.