- The paper presents a promptable segmentation approach that extends the LongiSeg framework using a ResEncL U-Net to enhance lesion tracking.
- It integrates point and mask prompts with longitudinal scan alignment, employing Gaussian blobs and binary masks for robust input representation.
- The method leverages synthetic pretraining to capture temporal lesion evolution, achieving up to a 6-point Dice improvement and reducing segmentation errors.
Promptable Longitudinal Lesion Segmentation in Whole-Body CT: Technical Summary and Implications
Introduction and Motivation
The paper addresses the challenge of robust lesion segmentation and tracking in longitudinal whole-body CT, a critical task for monitoring oncological disease progression and treatment response. While cross-sectional segmentation methods have matured, consistent tracking of individual lesions across timepoints remains nontrivial due to anatomical variability, patient positioning, and limited annotated datasets. The autoPET/CT IV Challenge Task 2 reframes this as a promptable segmentation problem, providing both baseline and follow-up lesion localizations and masks, thus enabling the exploration of interactive, temporally-aware segmentation frameworks.
Methodological Framework
The proposed approach extends the LongiSeg framework to support promptable segmentation via point and mask interactions. The backbone is a deep U-Net variant (ResEncL), characterized by multiple residual blocks per encoder layer, facilitating robust feature extraction from volumetric CT data. Longitudinal context is incorporated by aligning baseline and follow-up scans using provided lesion center coordinates and concatenating them along the channel dimension. Prompt representation leverages Gaussian blobs for point prompts, normalized to unit intensity, and binary masks for mask prompts, both supplied as additional input channels.
Training employs lesion-centric patch extraction, with random spatial shifts to enhance robustness. Pretraining is performed on a large synthetic longitudinal CT dataset generated via anatomy-informed augmentation, addressing the limited size of the challenge dataset and enabling the model to learn temporal lesion evolution patterns.
Results and Quantitative Analysis
Ablation studies and five-fold cross-validation demonstrate that longitudinal pretraining on synthetic data yields substantial improvements. The final model achieves a Dice score of 63.71, outperforming models trained from scratch or initialized with weights from related frameworks (nnInteractive, LesionLocator). Notably, pretraining enables the model to exploit longitudinal context, with a Dice improvement of up to 6 points over baseline approaches. False negative and false positive volumes are also minimized, indicating enhanced lesion detection and reduced oversegmentation.



Figure 1: Case 1 – Ground Truth segmentation for a representative patient, illustrating the complexity of lesion boundaries and spatial distribution.
Qualitative results reveal accurate lesion tracking and segmentation, with most errors localized to boundary regions. The ensemble of five cross-validation folds further stabilizes performance for test set submission.
Architectural and Implementation Considerations
- Backbone Selection: The ResEncL U-Net backbone is well-suited for volumetric medical imaging, offering deep hierarchical feature extraction and residual learning for improved gradient flow.
- Prompt Encoding: Gaussian blobs for point prompts, normalized to unit intensity, are empirically superior to unnormalized representations, likely due to harmonized input scaling.
- Longitudinal Input Alignment: Precise alignment via lesion centers is essential; random shifts during training improve generalization, while perfect alignment during inference maximizes accuracy.
- Pretraining Strategy: Large-scale synthetic longitudinal data is critical for learning temporal lesion dynamics, especially when annotated real-world datasets are limited.
- Batch Size: Smaller batch sizes (e.g., 2) yield better performance, possibly due to improved gradient estimation and regularization effects in the context of limited data.
Practical and Theoretical Implications
The integration of promptable segmentation with longitudinal context sets a new standard for lesion tracking in whole-body CT. The demonstrated gains from synthetic pretraining highlight the importance of data augmentation and simulation in medical imaging, especially for rare or complex tasks. The framework is readily extensible to other modalities and diseases where temporal lesion evolution is clinically relevant.
From a deployment perspective, the model's reliance on lesion center coordinates and prompt masks aligns with clinical workflows, where radiologists can provide minimal interaction to guide segmentation. The single-pass inference per lesion and support for multilabel outputs facilitate integration into PACS systems and downstream analytics.
Future Directions
Potential avenues for further research include:
- Generalization to Unseen Lesion Types: Extending the synthetic pretraining pipeline to encompass a broader spectrum of lesion morphologies and anatomical sites.
- Uncertainty Quantification: Incorporating Bayesian or ensemble methods to estimate segmentation confidence, particularly in ambiguous regions.
- Active Learning: Leveraging model uncertainty to prioritize annotation of challenging cases, thereby improving data efficiency.
- Multi-modal Integration: Fusing PET and CT data for enhanced lesion characterization and tracking.
Conclusion
The paper presents a technically rigorous extension of the LongiSeg framework, combining promptable segmentation and longitudinal context for accurate lesion tracking in whole-body CT. The use of large-scale synthetic pretraining is shown to be essential for robust performance, with strong quantitative and qualitative results. The approach is well-positioned for clinical translation and further methodological innovation in longitudinal medical image analysis.