CT2RepLong: Longitudinal CT Report Generation

Updated 30 March 2026

The paper introduces CT2RepLong, a state-of-the-art framework that incorporates longitudinal CT data and prior reports using cross-attention and hierarchical memory.
It employs dual-stage multi-modal fusion through transformer encoders and decoders with relational memory to capture complex inter-scan dependencies.
The approach outperforms simple concatenation baselines with improved BLEU, METEOR, and clinical efficacy metrics, highlighting its practical impact in radiology.

CT2RepLong denotes a high-performance, end-to-end framework for automated radiology report generation from 3D medical images, specifically non-contrast chest computed tomography (CT) scans, designed to explicitly incorporate longitudinal multi-modal context—namely, prior-imaging volumes and previous radiology reports—during report generation. It represents a substantial architectural and methodological advance over prior art (notably CT2Rep) by introducing a cross-attention-based multi-modal fusion module, hierarchical memory, and domain-adapted encoder/decoder designs to more accurately capture the complex process by which radiologists leverage historical case data when drafting follow-up reports (Hamamci et al., 2024).

1. Model Architecture and Data Flow

CT2RepLong processes a new chest CT scan and corresponding longitudinal data through a structured pipeline:

3D Vision Feature Extraction: The input CT volume ( $x^{\text{new}} \in \mathbb{R}^{240 \times 480 \times 480}$ ) is partitioned into non-overlapping spatio-temporal patches ( $x_n \in \mathbb{R}^{12 \times 24 \times 24}$ ), each embedded into a $D=512$ -dimensional representation, producing a tensor $z_x$ . A deep stack of spatial and causal (auto-regressive) transformers processes these embeddings, ensuring complete exploitation of volumetric anatomical context and strict prevention of feature leakage from future slices (Hamamci et al., 2024).
Transformer Encoder: The resulting patch embeddings are encoded into hidden states $(h_1, \dots, h_N)$ using multi-head attention:

$\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^{\top}}{\sqrt{d_k}} \right) V$

These serve as the key–value memory for subsequent report generation.

Transformer Decoder with Relational Memory and Conditional LayerNorm: The decoder uses a relational memory (RM) module: at each token timestep $t$ , a memory matrix $M_{t-1} \in \mathbb{R}^{m \times d}$ incorporates the semantic patterns accumulated over the report prefix $y_{<t}$ . Updates are computed via attention, and Memory-driven Conditional LayerNorm (MCLN) provides dynamic normalization parameters as learned projections of $M_t$ , facilitating rich context-sensitive decoding.
Hierarchical Memory and Cross-Attention: The decoder both attends to the encoder output $x_n \in \mathbb{R}^{12 \times 24 \times 24}$ 0 and, crucially, to fused representations derived from prior visit data (see §2 below).

This end-to-end structure is optimized via cross-entropy loss on ground-truth radiology reports, and incorporates standard regularization (weight decay, dropout, learning rate scheduling).

To model longitudinal reasoning, CT2RepLong introduces a dedicated dual-stage cross-attention pipeline:

Longitudinal Cross-Attention ( $x_n \in \mathbb{R}^{12 \times 24 \times 24}$ 1): Features are extracted from the prior CT volume ( $x_n \in \mathbb{R}^{12 \times 24 \times 24}$ 2) using the same 3D vision encoder, yielding $x_n \in \mathbb{R}^{12 \times 24 \times 24}$ 3, while the previous report ( $x_n \in \mathbb{R}^{12 \times 24 \times 24}$ 4) is encoded into $x_n \in \mathbb{R}^{12 \times 24 \times 24}$ 5 via a text transformer. Cross-attention is bi-directionally performed:

$x_n \in \mathbb{R}^{12 \times 24 \times 24}$ 6

These are concatenated to form the fused longitudinal history vector $x_n \in \mathbb{R}^{12 \times 24 \times 24}$ 7.

Memory Cross-Attention ( $x_n \in \mathbb{R}^{12 \times 24 \times 24}$ 8): At each decoder timestep, the RM state $x_n \in \mathbb{R}^{12 \times 24 \times 24}$ 9 queries $D=512$ 0:

$D=512$ 1

$D=512$ 2 injects the longitudinal context into the decoder via MCLN, biasing generation toward findings that are recurrent or evolving when compared with previous assessments.

This architecture allows the model to learn fine-grained mappings between historical imaging/textual findings and current scan features, outperforming simple embedding concatenation or single-modal fusion(Hamamci et al., 2024).

3. Hierarchical and Relational Memory Mechanisms

CT2RepLong employs a two-tier memory design to encode both intra-report cohesion and cross-visit dependencies:

Relational Memory (RM): Configured as $D=512$ 3 slots with $D=512$ 4 dimensions per slot, RM updates at each decoder timestep, accumulating abstracted phrase-level and longitudinal patterns, thus enhancing consistency and contextual recall over long outputs.
Longitudinal Memory Cross-Attention: The external bank $D=512$ 5 serves as read-only long-term memory, capturing transferable patterns and explicitly enabling the decoder to retrieve multi-modal cues from both prior imaging and narrative data during generation.

Injecting $D=512$ 6 into each decoder layer’s normalization parameters ensures that every layer’s representation is consistently aligned with salient longitudinal context, a property especially relevant in follow-up clinical interpretations.

4. Dataset, Preprocessing, and Training

Data Composition: Training leverages 25,701 non-contrast chest CT volumes (21,314 patients), augmented via three reconstruction windows (yielding 49,138 volumes). For longitudinal modeling: 6,766 training and 429 validation prior volumes (at least three scans per patient), with a total of 28,441 training and 1,689 validation chronological pairs (Hamamci et al., 2024).
Preprocessing: Images are resampled to 0.75mm × 0.75mm × 1.5mm voxel spacing, center-cropped/padded to $D=512$ 7, HU values clipped to $D=512$ 8. Only "Findings" sections are used for report generation.
Optimization: Single A100 GPU, Adam optimizer ( $D=512$ 9), learning rates $z_x$ 0 (image encoder), $z_x$ 1 elsewhere, batch size 1, dropout $z_x$ 2, 20 epochs, with StepLR scheduler. No auxiliary losses beyond cross-entropy. The model is implemented in PyTorch/HuggingFace.

5. Experimental Results and Ablation Studies

Natural Language Generation Metrics: Compared to naïve multimodal fusion (simple concatenation of prior report and volume embeddings), CT2RepLong’s attention-hierarchical approach achieves superior BLEU-1 ( $z_x$ 3 vs $z_x$ 4), BLEU-4 ( $z_x$ 5 vs $z_x$ 6), and METEOR ( $z_x$ 7) scores (Hamamci et al., 2024).
Clinical Efficacy: Evaluated with fine-tuned CXR-BERT for 18 finding labels, CT2RepLong achieves a clinical efficacy F1 of $z_x$ 8 versus lower scores for ablated or unimodal variants; performance degrades when prior report or volume modalities are used alone, confirming the necessity of joint attention-driven fusion.
Qualitative Analysis: The framework consistently reuses contextually appropriate clinical descriptors (e.g., "cardiomegaly," "calcified atherosclerotic plaques") present in prior reports, attesting to the effectiveness of the memory and fusion mechanisms.
Efficiency: Inference requires approximately $z_x$ 9s per volume pair (CT2Rep baseline: $(h_1, \dots, h_N)$ 0s); the added compute is justified given observed performance improvements.

6. Design Implications, Hyperparameters, and Reproducibility

Model size: Patch embedding and dual-transformer vision modules comprise $(h_1, \dots, h_N)$ 160M parameters; 6-layer encoder, 6-layer decoder, each with 8 attention heads.
Memory hyperparameters: RM with $(h_1, \dots, h_N)$ 2 slots, $(h_1, \dots, h_N)$ 3. Dropout $(h_1, \dots, h_N)$ 4, optional label smoothing $(h_1, \dots, h_N)$ 5.
Framework: PyTorch with integration of HuggingFace transformers, SimpleITK for DICOM/HU processing.
Reproducibility: All steps from patching, embedding, model instantiation, and data handling are fully specified in the referenced implementation.

7. Significance and Future Directions

CT2RepLong establishes a new state-of-the-art for 3D radiology report generation under longitudinal follow-up, demonstrating that explicit, attention-based multi-modal fusion with hierarchical memory markedly outperforms conventional concatenation or modality-isolated baselines—both by linguistic metrics and clinically relevant findings extraction (Hamamci et al., 2024). The careful engineering of memory cross-attention and multi-modal fusion provides a blueprint for further advances in longitudinally aware medical report generation, particularly as larger, more richly annotated multi-temporal imaging corpora become available.

Markdown Report Issue Upgrade to Chat

References (1)

CT2Rep: Automated Radiology Report Generation for 3D Medical Imaging (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CT2RepLong.

CT2RepLong: Longitudinal CT Report Generation

1. Model Architecture and Data Flow

3. Hierarchical and Relational Memory Mechanisms

4. Dataset, Preprocessing, and Training

5. Experimental Results and Ablation Studies

6. Design Implications, Hyperparameters, and Reproducibility

7. Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CT2RepLong: Longitudinal CT Report Generation

1. Model Architecture and Data Flow

2. Multi-Modal Fusion and Longitudinal Cross-Attention

3. Hierarchical and Relational Memory Mechanisms

4. Dataset, Preprocessing, and Training

5. Experimental Results and Ablation Studies

6. Design Implications, Hyperparameters, and Reproducibility

7. Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research