CT2RepLong: Longitudinal CT Report Generation
- The paper introduces CT2RepLong, a state-of-the-art framework that incorporates longitudinal CT data and prior reports using cross-attention and hierarchical memory.
- It employs dual-stage multi-modal fusion through transformer encoders and decoders with relational memory to capture complex inter-scan dependencies.
- The approach outperforms simple concatenation baselines with improved BLEU, METEOR, and clinical efficacy metrics, highlighting its practical impact in radiology.
CT2RepLong denotes a high-performance, end-to-end framework for automated radiology report generation from 3D medical images, specifically non-contrast chest computed tomography (CT) scans, designed to explicitly incorporate longitudinal multi-modal context—namely, prior-imaging volumes and previous radiology reports—during report generation. It represents a substantial architectural and methodological advance over prior art (notably CT2Rep) by introducing a cross-attention-based multi-modal fusion module, hierarchical memory, and domain-adapted encoder/decoder designs to more accurately capture the complex process by which radiologists leverage historical case data when drafting follow-up reports (Hamamci et al., 2024).
1. Model Architecture and Data Flow
CT2RepLong processes a new chest CT scan and corresponding longitudinal data through a structured pipeline:
- 3D Vision Feature Extraction: The input CT volume () is partitioned into non-overlapping spatio-temporal patches (), each embedded into a -dimensional representation, producing a tensor . A deep stack of spatial and causal (auto-regressive) transformers processes these embeddings, ensuring complete exploitation of volumetric anatomical context and strict prevention of feature leakage from future slices (Hamamci et al., 2024).
- Transformer Encoder: The resulting patch embeddings are encoded into hidden states using multi-head attention:
These serve as the key–value memory for subsequent report generation.
- Transformer Decoder with Relational Memory and Conditional LayerNorm: The decoder uses a relational memory (RM) module: at each token timestep , a memory matrix incorporates the semantic patterns accumulated over the report prefix . Updates are computed via attention, and Memory-driven Conditional LayerNorm (MCLN) provides dynamic normalization parameters as learned projections of , facilitating rich context-sensitive decoding.
- Hierarchical Memory and Cross-Attention: The decoder both attends to the encoder output 0 and, crucially, to fused representations derived from prior visit data (see §2 below).
This end-to-end structure is optimized via cross-entropy loss on ground-truth radiology reports, and incorporates standard regularization (weight decay, dropout, learning rate scheduling).
2. Multi-Modal Fusion and Longitudinal Cross-Attention
To model longitudinal reasoning, CT2RepLong introduces a dedicated dual-stage cross-attention pipeline:
- Longitudinal Cross-Attention (1): Features are extracted from the prior CT volume (2) using the same 3D vision encoder, yielding 3, while the previous report (4) is encoded into 5 via a text transformer. Cross-attention is bi-directionally performed:
6
These are concatenated to form the fused longitudinal history vector 7.
- Memory Cross-Attention (8): At each decoder timestep, the RM state 9 queries 0:
1
2 injects the longitudinal context into the decoder via MCLN, biasing generation toward findings that are recurrent or evolving when compared with previous assessments.
This architecture allows the model to learn fine-grained mappings between historical imaging/textual findings and current scan features, outperforming simple embedding concatenation or single-modal fusion(Hamamci et al., 2024).
3. Hierarchical and Relational Memory Mechanisms
CT2RepLong employs a two-tier memory design to encode both intra-report cohesion and cross-visit dependencies:
- Relational Memory (RM): Configured as 3 slots with 4 dimensions per slot, RM updates at each decoder timestep, accumulating abstracted phrase-level and longitudinal patterns, thus enhancing consistency and contextual recall over long outputs.
- Longitudinal Memory Cross-Attention: The external bank 5 serves as read-only long-term memory, capturing transferable patterns and explicitly enabling the decoder to retrieve multi-modal cues from both prior imaging and narrative data during generation.
Injecting 6 into each decoder layer’s normalization parameters ensures that every layer’s representation is consistently aligned with salient longitudinal context, a property especially relevant in follow-up clinical interpretations.
4. Dataset, Preprocessing, and Training
- Data Composition: Training leverages 25,701 non-contrast chest CT volumes (21,314 patients), augmented via three reconstruction windows (yielding 49,138 volumes). For longitudinal modeling: 6,766 training and 429 validation prior volumes (at least three scans per patient), with a total of 28,441 training and 1,689 validation chronological pairs (Hamamci et al., 2024).
- Preprocessing: Images are resampled to 0.75mm × 0.75mm × 1.5mm voxel spacing, center-cropped/padded to 7, HU values clipped to 8. Only "Findings" sections are used for report generation.
- Optimization: Single A100 GPU, Adam optimizer (9), learning rates 0 (image encoder), 1 elsewhere, batch size 1, dropout 2, 20 epochs, with StepLR scheduler. No auxiliary losses beyond cross-entropy. The model is implemented in PyTorch/HuggingFace.
5. Experimental Results and Ablation Studies
- Natural Language Generation Metrics: Compared to naïve multimodal fusion (simple concatenation of prior report and volume embeddings), CT2RepLong’s attention-hierarchical approach achieves superior BLEU-1 (3 vs 4), BLEU-4 (5 vs 6), and METEOR (7) scores (Hamamci et al., 2024).
- Clinical Efficacy: Evaluated with fine-tuned CXR-BERT for 18 finding labels, CT2RepLong achieves a clinical efficacy F1 of 8 versus lower scores for ablated or unimodal variants; performance degrades when prior report or volume modalities are used alone, confirming the necessity of joint attention-driven fusion.
- Qualitative Analysis: The framework consistently reuses contextually appropriate clinical descriptors (e.g., "cardiomegaly," "calcified atherosclerotic plaques") present in prior reports, attesting to the effectiveness of the memory and fusion mechanisms.
- Efficiency: Inference requires approximately 9s per volume pair (CT2Rep baseline: 0s); the added compute is justified given observed performance improvements.
6. Design Implications, Hyperparameters, and Reproducibility
- Model size: Patch embedding and dual-transformer vision modules comprise 160M parameters; 6-layer encoder, 6-layer decoder, each with 8 attention heads.
- Memory hyperparameters: RM with 2 slots, 3. Dropout 4, optional label smoothing 5.
- Framework: PyTorch with integration of HuggingFace transformers, SimpleITK for DICOM/HU processing.
- Reproducibility: All steps from patching, embedding, model instantiation, and data handling are fully specified in the referenced implementation.
7. Significance and Future Directions
CT2RepLong establishes a new state-of-the-art for 3D radiology report generation under longitudinal follow-up, demonstrating that explicit, attention-based multi-modal fusion with hierarchical memory markedly outperforms conventional concatenation or modality-isolated baselines—both by linguistic metrics and clinically relevant findings extraction (Hamamci et al., 2024). The careful engineering of memory cross-attention and multi-modal fusion provides a blueprint for further advances in longitudinally aware medical report generation, particularly as larger, more richly annotated multi-temporal imaging corpora become available.