Papers
Topics
Authors
Recent
Search
2000 character limit reached

CT2RepLong: Longitudinal CT Report Generation

Updated 30 March 2026
  • The paper introduces CT2RepLong, a state-of-the-art framework that incorporates longitudinal CT data and prior reports using cross-attention and hierarchical memory.
  • It employs dual-stage multi-modal fusion through transformer encoders and decoders with relational memory to capture complex inter-scan dependencies.
  • The approach outperforms simple concatenation baselines with improved BLEU, METEOR, and clinical efficacy metrics, highlighting its practical impact in radiology.

CT2RepLong denotes a high-performance, end-to-end framework for automated radiology report generation from 3D medical images, specifically non-contrast chest computed tomography (CT) scans, designed to explicitly incorporate longitudinal multi-modal context—namely, prior-imaging volumes and previous radiology reports—during report generation. It represents a substantial architectural and methodological advance over prior art (notably CT2Rep) by introducing a cross-attention-based multi-modal fusion module, hierarchical memory, and domain-adapted encoder/decoder designs to more accurately capture the complex process by which radiologists leverage historical case data when drafting follow-up reports (Hamamci et al., 2024).

1. Model Architecture and Data Flow

CT2RepLong processes a new chest CT scan and corresponding longitudinal data through a structured pipeline:

  • 3D Vision Feature Extraction: The input CT volume (xnew∈R240×480×480x^{\text{new}} \in \mathbb{R}^{240 \times 480 \times 480}) is partitioned into non-overlapping spatio-temporal patches (xn∈R12×24×24x_n \in \mathbb{R}^{12 \times 24 \times 24}), each embedded into a D=512D=512-dimensional representation, producing a tensor zxz_x. A deep stack of spatial and causal (auto-regressive) transformers processes these embeddings, ensuring complete exploitation of volumetric anatomical context and strict prevention of feature leakage from future slices (Hamamci et al., 2024).
  • Transformer Encoder: The resulting patch embeddings are encoded into hidden states (h1,…,hN)(h_1, \dots, h_N) using multi-head attention:

Attention(Q,K,V)=softmax(QK⊤dk)V\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^{\top}}{\sqrt{d_k}} \right) V

These serve as the key–value memory for subsequent report generation.

  • Transformer Decoder with Relational Memory and Conditional LayerNorm: The decoder uses a relational memory (RM) module: at each token timestep tt, a memory matrix Mt−1∈Rm×dM_{t-1} \in \mathbb{R}^{m \times d} incorporates the semantic patterns accumulated over the report prefix y<ty_{<t}. Updates are computed via attention, and Memory-driven Conditional LayerNorm (MCLN) provides dynamic normalization parameters as learned projections of MtM_t, facilitating rich context-sensitive decoding.
  • Hierarchical Memory and Cross-Attention: The decoder both attends to the encoder output xn∈R12×24×24x_n \in \mathbb{R}^{12 \times 24 \times 24}0 and, crucially, to fused representations derived from prior visit data (see §2 below).

This end-to-end structure is optimized via cross-entropy loss on ground-truth radiology reports, and incorporates standard regularization (weight decay, dropout, learning rate scheduling).

2. Multi-Modal Fusion and Longitudinal Cross-Attention

To model longitudinal reasoning, CT2RepLong introduces a dedicated dual-stage cross-attention pipeline:

  • Longitudinal Cross-Attention (xn∈R12×24×24x_n \in \mathbb{R}^{12 \times 24 \times 24}1): Features are extracted from the prior CT volume (xn∈R12×24×24x_n \in \mathbb{R}^{12 \times 24 \times 24}2) using the same 3D vision encoder, yielding xn∈R12×24×24x_n \in \mathbb{R}^{12 \times 24 \times 24}3, while the previous report (xn∈R12×24×24x_n \in \mathbb{R}^{12 \times 24 \times 24}4) is encoded into xn∈R12×24×24x_n \in \mathbb{R}^{12 \times 24 \times 24}5 via a text transformer. Cross-attention is bi-directionally performed:

xn∈R12×24×24x_n \in \mathbb{R}^{12 \times 24 \times 24}6

These are concatenated to form the fused longitudinal history vector xn∈R12×24×24x_n \in \mathbb{R}^{12 \times 24 \times 24}7.

  • Memory Cross-Attention (xn∈R12×24×24x_n \in \mathbb{R}^{12 \times 24 \times 24}8): At each decoder timestep, the RM state xn∈R12×24×24x_n \in \mathbb{R}^{12 \times 24 \times 24}9 queries D=512D=5120:

D=512D=5121

D=512D=5122 injects the longitudinal context into the decoder via MCLN, biasing generation toward findings that are recurrent or evolving when compared with previous assessments.

This architecture allows the model to learn fine-grained mappings between historical imaging/textual findings and current scan features, outperforming simple embedding concatenation or single-modal fusion(Hamamci et al., 2024).

3. Hierarchical and Relational Memory Mechanisms

CT2RepLong employs a two-tier memory design to encode both intra-report cohesion and cross-visit dependencies:

  • Relational Memory (RM): Configured as D=512D=5123 slots with D=512D=5124 dimensions per slot, RM updates at each decoder timestep, accumulating abstracted phrase-level and longitudinal patterns, thus enhancing consistency and contextual recall over long outputs.
  • Longitudinal Memory Cross-Attention: The external bank D=512D=5125 serves as read-only long-term memory, capturing transferable patterns and explicitly enabling the decoder to retrieve multi-modal cues from both prior imaging and narrative data during generation.

Injecting D=512D=5126 into each decoder layer’s normalization parameters ensures that every layer’s representation is consistently aligned with salient longitudinal context, a property especially relevant in follow-up clinical interpretations.

4. Dataset, Preprocessing, and Training

  • Data Composition: Training leverages 25,701 non-contrast chest CT volumes (21,314 patients), augmented via three reconstruction windows (yielding 49,138 volumes). For longitudinal modeling: 6,766 training and 429 validation prior volumes (at least three scans per patient), with a total of 28,441 training and 1,689 validation chronological pairs (Hamamci et al., 2024).
  • Preprocessing: Images are resampled to 0.75mm × 0.75mm × 1.5mm voxel spacing, center-cropped/padded to D=512D=5127, HU values clipped to D=512D=5128. Only "Findings" sections are used for report generation.
  • Optimization: Single A100 GPU, Adam optimizer (D=512D=5129), learning rates zxz_x0 (image encoder), zxz_x1 elsewhere, batch size 1, dropout zxz_x2, 20 epochs, with StepLR scheduler. No auxiliary losses beyond cross-entropy. The model is implemented in PyTorch/HuggingFace.

5. Experimental Results and Ablation Studies

  • Natural Language Generation Metrics: Compared to naïve multimodal fusion (simple concatenation of prior report and volume embeddings), CT2RepLong’s attention-hierarchical approach achieves superior BLEU-1 (zxz_x3 vs zxz_x4), BLEU-4 (zxz_x5 vs zxz_x6), and METEOR (zxz_x7) scores (Hamamci et al., 2024).
  • Clinical Efficacy: Evaluated with fine-tuned CXR-BERT for 18 finding labels, CT2RepLong achieves a clinical efficacy F1 of zxz_x8 versus lower scores for ablated or unimodal variants; performance degrades when prior report or volume modalities are used alone, confirming the necessity of joint attention-driven fusion.
  • Qualitative Analysis: The framework consistently reuses contextually appropriate clinical descriptors (e.g., "cardiomegaly," "calcified atherosclerotic plaques") present in prior reports, attesting to the effectiveness of the memory and fusion mechanisms.
  • Efficiency: Inference requires approximately zxz_x9s per volume pair (CT2Rep baseline: (h1,…,hN)(h_1, \dots, h_N)0s); the added compute is justified given observed performance improvements.

6. Design Implications, Hyperparameters, and Reproducibility

  • Model size: Patch embedding and dual-transformer vision modules comprise (h1,…,hN)(h_1, \dots, h_N)160M parameters; 6-layer encoder, 6-layer decoder, each with 8 attention heads.
  • Memory hyperparameters: RM with (h1,…,hN)(h_1, \dots, h_N)2 slots, (h1,…,hN)(h_1, \dots, h_N)3. Dropout (h1,…,hN)(h_1, \dots, h_N)4, optional label smoothing (h1,…,hN)(h_1, \dots, h_N)5.
  • Framework: PyTorch with integration of HuggingFace transformers, SimpleITK for DICOM/HU processing.
  • Reproducibility: All steps from patching, embedding, model instantiation, and data handling are fully specified in the referenced implementation.

7. Significance and Future Directions

CT2RepLong establishes a new state-of-the-art for 3D radiology report generation under longitudinal follow-up, demonstrating that explicit, attention-based multi-modal fusion with hierarchical memory markedly outperforms conventional concatenation or modality-isolated baselines—both by linguistic metrics and clinically relevant findings extraction (Hamamci et al., 2024). The careful engineering of memory cross-attention and multi-modal fusion provides a blueprint for further advances in longitudinally aware medical report generation, particularly as larger, more richly annotated multi-temporal imaging corpora become available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CT2RepLong.