SurLonFormer: Dynamic Survival Prediction
- SurLonFormer is a Transformer-based architecture that integrates longitudinal medical imaging and clinical covariates for dynamic survival prediction.
- It employs a vision encoder, autoregressive sequence encoder, and a survival encoder inspired by Cox proportional hazards to address censoring, temporal correlations, and interpretability issues.
- Empirical evaluations in simulations and Alzheimer’s disease applications demonstrate its superior predictive performance and effective spatial biomarker identification.
SurLonFormer is a Transformer-based neural architecture developed for dynamic survival prediction using longitudinal medical imaging alongside structured clinical covariates. The architecture addresses key limitations in existing survival models, notably the suboptimal exploitation of censored data, neglect of temporal correlations among serial images, and limited interpretability. SurLonFormer integrates a vision transformer for patch-wise image feature extraction, an autoregressive transformer for longitudinal sequence modeling, and a Cox proportional hazards-inspired neural network for risk estimation, thereby enabling flexible and interpretable dynamic predictions in high-dimensional and temporally-evolving medical datasets.
1. Architectural Components
SurLonFormer is composed of three principal modules designed for hierarchical representation and risk modeling:
- Vision Encoder: Processes individual medical images (e.g., MRI scans) by partitioning each into equal-sized patches, flattening, and projecting each patch into a -dimensional embedding space. A learnable CLS token (CLS_v) is appended, and positional encodings are incorporated, yielding an input of shape per image. This sequence is propagated through self-attention transformer encoder layers with multi-head attention, residual connections, and normalization. The post-transformer embedding of the CLS_v token serves as the image-level representation for visit of patient :
- Sequence Encoder: Receives the temporal sequence of image embeddings for each patient for all visits conducted up to landmark time . With the sequence plus a learnable CLS token (CLS_l), the model forms a sequence of length input to an autoregressive ( layers) transformer encoder with causal masking. Causal masking restricts attention to historical and current timepoints, preserving temporal order and preventing information leakage. The CLS_l embedding output is denoted , summarizing the patient's longitudinal progression:
- Survival Encoder: Fuses the learned longitudinal sequence embedding with scalar covariates (e.g., demographic or clinical features) via a one-hidden-layer feed-forward neural network (FFNN) with GELU activation. The output is a patient-specific risk score , parameterized as:
2. Cox Proportional Hazards Integration
SurLonFormer implements dynamic survival prediction by embedding the Cox proportional hazards model within a neural framework:
- Hazard and Survival Functions: For subject , the hazard rate at time is:
Here, is the nonparametric baseline hazard, while is the learned risk score replacing the linear predictor in classical Cox models. The survival function is:
with as the cumulative baseline hazard. Model optimization proceeds via maximization of the Cox partial likelihood:
where denotes event (1) or censoring (0) and is the risk set at event time . Elastic Net regularization is included to counteract overfitting, especially under limited data regimes.
3. Handling of Censoring, Scalability, and Interpretability
- Censoring: SurLonFormer directly accommodates censored data using the Cox partial likelihood, leveraging risk sets determined at each failure time to ensure both uncensored and censored observations contribute appropriately to the likelihood term without explicit modeling of .
- Scalability: Computational complexity for the vision and sequence encoders is dominated by self-attention operations:
where is the number of attention heads, are depth of the transformer blocks, and is embedding dimension. Model expressivity and computational budget are balanced via these hyperparameters.
- Interpretability — Occlusion Sensitivity: To elucidate which imaging regions most strongly drive risk estimates, SurLonFormer employs occlusion sensitivity analysis. Each image is divided into non-overlapping regions. Sequentially masking (with a baseline value, e.g., black patch) each region and measuring the change in the predicted risk absolute score quantifies region influence. Sensitivity maps overlayed on original images reveal spatial areas critical for the model's predictive decisions, enabling insight into disease biomarkers.
4. Empirical Evaluation and Results
- Simulation Studies: Experiments on synthetic longitudinal imaging data characterized by non-smooth, spatially global features employed Frobenius inner product-based ground-truth risk scores and Cox-derived survival times. Benchmarked against FPCA-Cox, LoFPCA-Cox, and CNN-LSTM baselines, SurLonFormer demonstrated higher time-dependent AUC and C-index as well as lower Brier Scores, evidencing enhanced discrimination and calibration.
- Alzheimer’s Disease Application (ADNI): Evaluation on the ADNI dataset, involving MRI-based longitudinal tracking towards Alzheimer’s onset, further substantiated SurLonFormer’s performance. It surpassed FPCA-Cox, LoFPCA-Cox, and CNN-LSTM in AUC, C-index, and Brier Score. Occlusion analysis corroborated the model’s emphasis on brain regions (frontal and temporal lobes) associated with Alzheimer’s pathology. Dynamic survival prediction allowed recalibration of event probabilities over multiple landmark times (e.g., 12, 24, 48 months), demonstrating model utility in updating individualized prognosis as patients progress.
5. Model Significance and Clinical Implications
SurLonFormer uniquely combines spatial feature extraction, temporal sequence modeling, and flexible risk estimation for survival analysis of high-dimensional medical data. By enabling joint learning with both imaging and structured data, it improves representation learning for complex disease trajectories. The incorporation of occlusion sensitivity facilitates model interpretability, critical for clinical trust and for the identification of disease-relevant biomarkers. Scalability features and principled handling of censored data suggest broad applicability in large-cohort, high-dimensional studies.
A plausible implication is extension to other modalities (e.g., 3D imaging, multimodal fusion) and broader disease contexts, given SurLonFormer’s architecture generalizes across longitudinal, high-dimensional prognostic tasks.
6. Conclusion
SurLonFormer constitutes a significant methodological advance for survival analysis using longitudinal imaging. Through its integrated transformer-based architecture, adherence to the Cox model’s statistical foundations, and robust interpretability mechanisms, it achieves state-of-the-art predictive performance and spatial biomarker identification in both simulation and real-world clinical settings. Its design and results underscore the growing capacity of deep learning models to address the complexities of censored, multi-visit, high-dimensional medical data in dynamic clinical risk modeling (Liu et al., 12 Aug 2025).