NLST-cmst: Longitudinal Pulmonary Nodule Dataset
- NLST-cmst is a curated, longitudinal pulmonary nodule dataset integrating temporal 3D CT imaging and clinical covariates for dynamic malignancy analysis.
- It features standardized ROI extraction, robust preprocessing, and both expert and AI-driven annotations to ensure reproducibility in model benchmarking.
- The dataset underpins advanced spatiotemporal modeling and graph-based multimodal fusion architectures, setting a new benchmark in pulmonary oncology research.
The NLST-cmst dataset is a rigorously annotated, longitudinal, multimodal pulmonary nodule dataset derived from the National Lung Screening Trial (NLST), designed to benchmark machine learning models for dynamic prediction of nodule malignancy. It offers standardized 3D CT scan sequences and detailed clinical covariates for each subject, incorporating robust preprocessing, harmonized temporal acquisition, and both radiologist- and AI-driven annotations. NLST-cmst is deployed as a reproducible platform for evaluating spatiotemporal deep learning architectures that fuse image and clinical data in pulmonary oncology research.
1. Origin and Construction
The NLST-cmst dataset is built upon the large-scale NLST trial cohort, which enrolled U.S. adults at high risk for lung cancer and conducted annual low-dose chest CT screening. From this source population, subsets of subjects are selected on the basis of available longitudinal CT volumes (minimum two time points: baseline and at least one follow-up ), paired with definitive pathologic diagnosis (benign/malignant outcome) established during clinical follow-up (Yu et al., 24 Dec 2025, Shen et al., 27 Jan 2025).
Selection pipelines differ slightly by study reporting on NLST-cmst. Typical criteria are:
- A minimum of two high-quality, low-dose CT acquisitions per subject.
- Pathology-verified nodule malignancy status (either subsequent cancer or negative follow-up across the study window).
- Availability of standardized clinical covariates: age, gender, smoking history, and NLST screen result.
Professional thoracic radiologists manually identify and crop a dominant pulmonary nodule’s 3D region-of-interest (RoI) from each per-subject CT series. The RoI is centered on the nodule and standardized to a 16 × 64 × 64 voxel cube (Yu et al., 24 Dec 2025, Shen et al., 27 Jan 2025). Imaging parameters (slice thickness, in-plane resolution) are not always reported in the literature but NLST standard protocols specify slice thickness of approximately 1.0–1.25 mm and in-plane resolution of ~0.5 mm × 0.5 mm.
Subject demographics in published splits are balanced by gender and reflect screened adults (predominantly ages 55–75). All included nodules are pathologically confirmed as benign or malignant, with downstream train/test splits preserving this ratio.
2. Multimodal and Temporal Structure
NLST-cmst is explicitly multimodal and multitemporal. For each subject:
- CT Imaging: At each of time points , a 3D RoI of voxels is extracted after uniform resampling. Inter-scan intervals are typically ≈1 year, although precise scheduling varies individually (Yu et al., 24 Dec 2025).
- Clinical Covariates: Age, gender, smoking status, and final screen result are retained and encoded as a feature vector, which is embedded via a multi-layer perceptron for machine learning use (Yu et al., 24 Dec 2025).
- Dataset Sizes: Reports enumerate 433 subjects in the main DGSAN study (Yu et al., 24 Dec 2025), 443 in CSF-Net (Shen et al., 27 Jan 2025), and up to 571 in the AI-enriched annotation resource (Krishnaswamy et al., 2023). Each subject generally contributes one tracked nodule.
- Label Assignment: Pathology is the gold standard; non-malignant designations require negative follow-up throughout the NLST observation period.
3. Preprocessing, Annotation, and Feature Engineering
3.1. Imaging Preprocessing
- Voxel Resampling: All volumes are resampled to isotropic or protocol-defined uniform voxel spacing.
- ROI Cropping: RoIs are extracted as centered on the nodule.
- Intensity Normalization: Hounsfield units within RoIs are clipped to the standard “lung window” and then z-score normalized:
where are the global mean and standard deviation over all training RoIs (Yu et al., 24 Dec 2025).
- Annotation: Manual (expert) segmentation is detailed in CSF-Net and DGSAN; the AI-annotated NLST-cmst adds automated thoracic organ segmentations (nnU-Net), slice-level anatomical landmarks (Body Part Regression), and PyRadiomics-derived 3D shape features, all encoded in DICOM SEG and SR objects (Krishnaswamy et al., 2023).
3.2. Feature Engineering
- Global-Local Feature Encoding: At each time ,
- Local features (texture/edge detail)
- Global features (context, Swin Transformer-style attention)
- Fused features (adaptive fusion block).
- Clinical Feature Embedding: Tabular covariates are concatenated into , then mapped to .
- In AI-derived expansions: 3D shape features (e.g., sphericity, elongation) are computed per thoracic organ mask.
4. Data Layout, Splits, and Benchmarking
- Train/Test/CV Splits: DGSAN’s cohort is partitioned with 80% training (347 subjects), 20% test (86 subjects), and 5-fold cross-validation within the training partition (Yu et al., 24 Dec 2025). Other studies (e.g., CSF-Net (Shen et al., 27 Jan 2025)) leave split details unspecified.
- Benchmarking Tasks:
- Primary: Dynamic malignancy prediction, integrating temporal imaging and clinical data.
- Secondary: Survival analysis (with radiomics features), anatomical structure segmentation, and computational pathology.
- Baseline Models: Extensive benchmarking is reported in the CSF-Net study, with performance metrics (accuracy, F1, AUC, recall, precision) tabulated below:
| Method | Accuracy | Precision | F1 | AUC | Recall |
|---|---|---|---|---|---|
| SCANs | 0.7865 | 0.7667 | 0.7077 | 0.7725 | 0.6571 |
| NAS-Lung | 0.8539 | 0.8235 | 0.8116 | 0.8910 | 0.8000 |
| T-LSTM | 0.7645 | 0.7012 | 0.6527 | 0.7778 | 0.6000 |
| DeepCAD | 0.8590 | 0.7879 | 0.8254 | 0.8990 | 0.8667 |
| MFCN | 0.7949 | 0.7059 | 0.7500 | 0.8903 | 0.8000 |
| RadFusion | 0.7753 | 0.8026 | 0.6667 | 0.7693 | 0.6000 |
| CSF-Net | 0.8974 | 0.8235 | 0.8750 | 0.9389 | 0.9333 |
This demonstrates substantial benefit from multimodal, longitudinal modeling.
5. Graph-Based Multimodal Fusion Architecture
A distinguishing feature of NLST-cmst protocols—especially in the DGSAN study—is formal graph-structured multimodal fusion:
- Node Set: (from two time points), .
- Graphs:
- Intra-modal graph: Fully connects local/global CT nodes.
- Inter-modal graph: Links every CT node to the clinical node (bidirectionally).
- Adjacency Structure: where is self-loops, is fixed diagonal, is learnable.
- Graph Attention Aggregation: GAT layer computes:
with as learned projections. This architecture enables explicit reasoning over both intra- and inter-modal feature relationships (Yu et al., 24 Dec 2025).
6. Licensing, Access, and Best Practices
- Data Use: NLST-cmst is released under the US National Cancer Institute’s NLST data-use agreement, requiring application to the NCI and explicit adherence to patient-privacy provisions (Yu et al., 24 Dec 2025).
- Distribution: AI-annotated variants (with DICOM SEG and SR) are hosted in the NCI Imaging Data Commons (IDC), accessible via guided BigQuery queries, Zenodo DOIs, and cloud-enabled notebooks (Krishnaswamy et al., 2023).
- Code and Reproducibility: Full preprocessing, feature encoding, graph-fusion, and model training pipelines (e.g., for DGSAN) are available in open-source repositories (Yu et al., 24 Dec 2025).
- Best Practices: Recommendations include harmonizing CT acquisition parameters (voxel spacing, lung-windowing), validating RoI annotations, standardizing follow-up intervals, and systematic quality control on AI-segmentations (via sphericity, compactness outlier plots; visual inspection in OHIF) (Krishnaswamy et al., 2023).
7. Scientific Impact and Context
NLST-cmst fills a crucial gap among open pulmonary nodule datasets:
- Longitudinality: Enables dynamic modeling of nodule evolution, unavailable in cross-sectional resources such as LIDC-IDRI.
- Clinical Integration: Pairs imaging sequences with covariates and pathology-verified outcomes, allowing cross-modal fusion and survival modeling.
- Benchmarking: Has established itself as a standard for evaluating multimodal, temporal deep learning approaches, with reproducible benchmarks for diverse architectures (Yu et al., 24 Dec 2025, Shen et al., 27 Jan 2025).
- Interoperability: AI-derived anatomical and radiomics annotations, encoded in standard DICOM, ensure compatibility with clinical and research pipelines (Krishnaswamy et al., 2023).
A plausible implication is that NLST-cmst will support future work in prognostic modeling, trajectory prediction, and radiogenomic studies by providing a harmonized, high-quality substrate for multimodal analysis. The dataset’s integration of temporal imaging and clinical information reflects a shift toward more comprehensive risk stratification frameworks in computational oncology.