Swin-BART Encoder-Decoder for Medical Captioning
- The paper demonstrates the effectiveness of a unified Swin-BART system with regional attention, achieving a ROUGE of 0.603 and BERTScore of 0.807 on the ROCO dataset.
- The system employs hierarchical window-based self-attention in the vision encoder and domain-specific PubMedBERT embeddings in the BART decoder to enhance diagnostic relevance.
- Ablation studies confirm that disabling the regional attention module significantly reduces performance, underscoring its critical role in maintaining clinical fidelity.
The Swin-BART encoder-decoder system is a unified framework for automated medical image captioning designed to translate complex radiological images into concise, diagnostically meaningful narratives. The system integrates a hierarchical Swin Transformer vision encoder, a BART-base text decoder enhanced with domain-specific embeddings, and a lightweight regional attention module that selectively amplifies salient regions in the visual input prior to cross-modal fusion. Trained and evaluated on the ROCO dataset, this system achieves state-of-the-art semantic fidelity and interpretability for clinical captioning tasks in CT, MRI, and X-ray modalities.
1. System Architecture
1.1 Swin Transformer Encoder
Input images are partitioned into non-overlapping patches. Each patch is linearly embedded, resulting in tokens of dimension 96. The encoder comprises a four-stage hierarchical backbone, where each stage applies Windowed Multi-Head Self-Attention (W-MSA) or Shifted-Window MSA (SW-MSA) over local windows (), followed by patch-merging that downsamples spatial resolution by and doubles the channel dimension. Each transformer block includes LayerNorm and a position-wise two-layer feed-forward network (FFN) with GELU activations.
After the final stage, the encoder produces a spatial feature tensor , which is flattened across spatial dimensions to yield .
1.2 BART Decoder and Biomedical Embeddings
The decoder adopts BART-base (12 layers) and generates caption tokens auto-regressively. Each layer employs masked self-attention for partially generated captions and cross-attention to encoder features . The decoder leverages 768-dimensional PubMedBERT embeddings instead of BART's default token embeddings to inject biomedical knowledge. These embeddings are frozen for the first epoch then jointly fine-tuned with decoder weights.
Multi-head attention follows the standard formula:
where for cross-attention, derives from decoder states, and from encoder outputs.
2. Regional Attention Module
The regional attention module is interposed between the Swin encoder output and the projection stage. It operates as follows:
- Importance Scoring: Compute region-wise importance weights using a linear layer and softmax:
with and .
- Feature Aggregation: Produce attended feature representations:
where .
- Dimensionality Projection and Pooling:
with . Adaptive average pooling yields the final encoder tokens:
Salient regions with high are emphasized, guiding decoder cross-attention toward diagnostically relevant features.
3. Learning Protocol and Objective
Training minimizes the standard cross-entropy loss over the predicted caption sequence:
The AdamW optimizer is employed with a learning rate of , weight decay $0.01$, and dropout $0.1$ on projection layers. The schedule consists of 5 epochs with early stopping (patience=3), batch size of 8, and fixed random seed (42, three independent runs are reported).
No auxiliary losses are incorporated beyond the main objective function, and model selection is based on meanstd performance with confidence intervals.
4. Inference and Decoding Mechanism
Inference utilizes beam search with beam size of 4, length penalty $1.1$, , and maximum output length of 128 tokens. This configuration ensures a balance between hypothesis diversity and avoidance of verbatim repetition, and discourages overly terse captioning.
A plausible implication is that these decoding constraints align the outputs with clinical reporting style while preventing generic or repetitive phrasing.
5. Quantitative Evaluation Metrics
Performance is benchmarked on the ROCO test split against BLIP2-OPT and ResNet-CNN baselines. All results are meanstd over three seeds, with confidence intervals.
| Metric | BLIP2-OPT | ResNet-CNN | Proposed (Swin–BART + RA) |
|---|---|---|---|
| ROUGE | 0.255±0.006 | 0.356±0.005 | 0.603±0.004 [0.595,0.611] |
| BLEU | 0.217±0.009 | 0.311±0.007 | 0.257±0.008 [0.241,0.273] |
| CIDEr | 0.231±0.007 | 0.296±0.006 | 0.215±0.005 [0.205,0.225] |
| METEOR | 0.092±0.002 | 0.084±0.003 | 0.081±0.002 [0.077,0.085] |
| BERTScore | 0.645±0.005 | 0.623±0.006 | 0.807±0.003 [0.801,0.813] |
The Swin-BART system achieves a substantial relative gain in ROUGE and absolute gain in BERTScore over the strongest baseline. BLEU, CIDEr, and METEOR are competitive, suggesting robust semantic and lexical fidelity.
6. Ablation Studies and Analysis
6.1 Regional Attention Effect
Disabling the regional attention module diminishes ROUGE from 0.603 to 0.538 and BERTScore from 0.807 to 0.762 (), confirming its critical role in guiding clinically relevant captioning.
6.2 Token Count Sweep
Varying the number of pooled tokens in demonstrates peak ROUGE at (0.603), with marginal drop at adjacent values. This suggests a trade-off between representation granularity and over-smoothing.
6.3 Modality-Specific Results
| Modality | ROUGE | BERTScore |
|---|---|---|
| CT | 0.615±0.005 | 0.814±0.004 |
| MRI | 0.590±0.006 | 0.798±0.005 |
| X-ray | 0.605±0.004 | 0.804±0.003 |
Captioning performance is consistent and high across modalities, with CT yielding the strongest scores.
6.4 Statistical Significance
Paired bootstrap (n=1,000) analysis affirms the statistical significance () of regional attention versus no-attention across all metrics.
6.5 Qualitative Visualization
Heatmaps overlaying the region-wise attention scores on the original images indicate strong alignment between model focus and annotated pathological regions in clinical reports, reinforcing interpretability and facilitating human-in-the-loop validation.
7. Interpretation and Implications
The Swin-BART encoder-decoder architecture leverages hierarchical windowed vision, biomedical-aware textual decoding, and targeted regional attention to maximize both semantic accuracy and transparency in medical image captioning. The compact design and robust attribution maps suggest practical applicability for clinical report triage, decision support, and safe integration with human oversight in diagnostic pipelines.
A plausible implication is that lightweight regional attention modules, when coupled with domain-adapted token embeddings, provide an effective and interpretable route to clinical fidelity in multimodal generative modeling. Further, the model’s evaluation regime—spanning thorough metric reporting, ablation, and modality breakdown—supports its reliability and potential for deployment in research-centric medical imaging environments.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free