Papers
Topics
Authors
Recent
2000 character limit reached

Lung-PET-CT-Dx: Multimodal Lung Imaging Dataset

Updated 1 December 2025
  • Lung-PET-CT-Dx is a multimodal imaging dataset combining PET, CT scans, and clinical metadata for lung tumor segmentation and subtype classification.
  • It offers two configurations—a classification cohort with structured EHR data and a segmentation cohort with expert-annotated NIfTI images—catering to distinct research objectives.
  • Robust preprocessing protocols, augmentation strategies, and standardized evaluation metrics ensure reproducible deep learning model performance across domain shifts.

The Lung-PET-CT-Dx dataset is a publicly available, multimodal medical imaging resource specifically designed to facilitate the development and evaluation of deep learning models in lung tumor segmentation and classification using positron emission tomography (PET), computed tomography (CT), and structured clinical electronic health record (EHR) data. Distinguished by its coupling of volumetric molecular imaging with rich patient metadata and rigorous expert annotation, Lung-PET-CT-Dx supports research on both tumor subtyping and cross-domain segmentation generalization. Multiple publications reference distinct versions and subsets of the resource, each emphasizing different properties, preprocessing protocols, and research applications (Yu et al., 6 Aug 2025, Ghosh et al., 26 Aug 2025).

1. Dataset Composition and Structure

Lung-PET-CT-Dx exists in two main configurations as documented in the literature:

  • Classification configuration (Yu et al., 6 Aug 2025):
    • 355 patients, each with at least one volumetric CT or PET–CT scan.
    • Diagnostic labels restricted to malignant subtypes: 251 adenocarcinoma, 61 squamous cell carcinoma; no benign controls included.
    • Associated structured data fields: gender, age at diagnosis, weight, TNM stage, and smoking history.
    • PET–CT studies in DICOM format; subsequent processing aggregates acquisitions as 192 × 192 × 12 stacks for 3D input.
  • Segmentation configuration (Ghosh et al., 26 Aug 2025):
    • 54 Indian patients (median age ~65, 30 males, 24 females) with biopsy-proven non-small-cell lung cancer (both adenocarcinoma and squamous cell carcinoma).
    • All cases feature comprehensive clinical metadata (age, sex, histologic subtype, TNM stage).
    • Processed as resampled isotropic (1 × 1 × 1 mm³) PET and CT volumes, distributed in NIfTI format: each folder contains CT.nii.gz, PET.nii.gz, and seg.nii.gz.

This bifurcated structure enables both disease classification (subtype prediction) and boundary-based tumor segmentation tasks.

2. Imaging Acquisition Protocols and Preprocessing

Data collection protocols vary across studies but conform to established oncologic imaging standards:

  • Imaging Modality: Whole-body PET–CT using ¹⁸F-fluorodeoxyglucose (¹⁸F-FDG) as the radiotracer. Clinical protocols for scan timing and preparation are followed; scanner make/model details are variably reported.
  • Reconstruction:
    • CT: Images reconstructed as 512 × 512 axial matrices; resampled to isotropic 1 mm³ voxels.
    • PET: Standardized to match CT resolution and voxel grid.
  • Intensity Normalization:
    • CT intensities clipped to [–1000, 1000] Hounsfield Units (HU) and normalized (segmentation cohort) or zero mean/unit variance (classification cohort).
    • PET intensities (Standard Uptake Values, SUV) scaled to a common range per cohort.
  • Preprocessing Pipelines:
    • Segmentation: Uniform isotropic resampling, intensity normalization, and augmentation (rotations, scaling, elastic deformation, gamma correction) mirroring nnU-Net protocols.
    • Classification: Slices rescaled to 192 × 192 pixels, grouped into 12-slice volumes, with random rotations (±10°) and sharpening for augmentation.

No explicit deformable inter-subject registration is performed; possible domain shifts due to acquisition heterogeneity remain (Yu et al., 6 Aug 2025).

3. Annotation Standards and Quality Assurance

Expert annotation protocols are dataset-specific:

  • Segmentation (Ghosh et al., 26 Aug 2025):
    • Manual binary segmentation by a board-certified nuclear medicine physician using ITK-SNAP.
    • Masks represent only metabolically active primary tumor regions as assessed by fused PET–CT.
    • No annotations for background, benign lesions, or adjacent structures.
    • Single-reader annotation; thus, inter-observer variability metrics are not reported.
  • Classification (Yu et al., 6 Aug 2025):
    • Labels assigned per patient based on confirmed histopathology (adenocarcinoma, squamous cell carcinoma).
    • EHR-derived metadata fields aggregated and cleaned.

The segmentation dataset serves as a held-out external test set for “stress-testing” cross-domain generalization, while the classification cohort uses oversampling and train/validation/test splits to address class imbalance.

4. Partitioning and Experimental Protocols

Partitioning and training strategies reflect the dataset’s intended benchmarking role:

  • Segmentation (Ghosh et al., 26 Aug 2025):
    • All 54 cases reserved as an external test set; no training or validation is performed on this subset in cross-cohort studies.
    • Standardized folder structure enables seamless nnU-Net pipeline integration.
  • Classification (Yu et al., 6 Aug 2025):
    • Train/validation/test splits are stratified by subtype:
    • Adenocarcinoma: majority allocated for training, with 12 validation and 15 test cases.
    • Squamous cell carcinoma: 34 train, 12 validation, 15 test; random oversampling applied to training set.
    • Held-out test set consists of 30 patients (15 of each subtype), used for all primary performance reporting.

Augmentation is performed systematically for imaging data but omitted for tabular or categorical EHR data.

5. Evaluation Metrics and Reporting

Performance evaluation is standardized and comprehensive:

  • Segmentation Metrics (Ghosh et al., 26 Aug 2025):
    • Dice Similarity Coefficient (DSC):

    DSC=2PGP+GDSC = \frac{2 \, |P \cap G|}{|P| + |G|}

    where PP is the predicted mask and GG the ground truth. - 95th-Percentile Hausdorff Distance (HD95HD_{95}): Quantifies the boundary agreement at the 95th percentile of contour distances. - Precision and recall also reported.

  • Classification Metrics (Yu et al., 6 Aug 2025):

    • Accuracy (ACC): (TP+TN)/(TP+TN+FP+FN)(TP + TN) / (TP + TN + FP + FN)
    • Sensitivity (SE): TP/(TP+FN)TP / (TP + FN)
    • Specificity (SP): TN/(TN+FP)TN / (TN + FP)
    • Positive Predictive Value (PPV): TP/(TP+FP)TP / (TP + FP)
    • Negative Predictive Value (NPV): TN/(TN+FN)TN / (TN + FN)
    • F1-score: 2(PrecisionRecall)/(Precision+Recall)2\cdot(\text{Precision}\cdot\text{Recall})/(\text{Precision}+\text{Recall})
    • Area Under the ROC Curve (AUROC).

Ground-truth and evaluation scripts typically match nnU-Net and MMCAF-Net benchmark conventions.

6. Limitations, Strengths, and Future Directions

The Lung-PET-CT-Dx dataset exhibits both notable strengths and known challenges:

  • Strengths:
    • Multimodal nature: Volumetric PET–CT coupled with detailed clinical metadata enables complex phenotype modeling and multimodal fusion strategies.
    • Public, de-identified release supports reproducible research and cross-institutional benchmarking.
    • Domain diversity: Contrasts with other public datasets (e.g., AutoPET), allowing for stress-testing of generalization across sites, populations, and scanner platforms.
    • Data organization: Standardized NIfTI folder structure, isotropic resampling, and explicit external test set protocols facilitate integration into modern deep learning pipelines.
  • Limitations:
    • Absence of benign cases and control patients precludes full spectrum disease identification tasks; classification restricted to malignant subtypes.
    • Severe class imbalance in squamous carcinoma, leading to low F1 stability despite oversampling.
    • Variability in acquisition parameters across cohorts introduces potential domain shifts; only rigid normalization is performed.
    • Segmentation annotations are single-reader for the lung cohort, limiting inter-observer validation.
  • Implications and Future Work:
    • Expansion to multi-class (benign/malignant/multihistology) with further pathological labels.
    • Harmonization across institutions and acquisition protocols may reduce cross-site domain effects; deformable registration or domain-adversarial training are potential avenues.
    • Enhanced evaluation of annotation variability, as only single-expert contours are presently available in the lung segmentation subset.
    • Validated approaches in Lung-PET-CT-Dx, such as 3D attention-based multimodal fusion (Yu et al., 6 Aug 2025), motivate research into efficient deep architectures leveraging coupled imaging and structured data.

7. Accessibility and Use Recommendations

  • The dataset is publicly hosted following a non-commercial research agreement; no access restrictions beyond institutional data-use terms (Ghosh et al., 26 Aug 2025).
  • Lung-PET-CT-Dx is recommended as a held-out external benchmark for empirical model validation, especially in conjunction with larger resources like AutoPET.
  • Best practices include strict train/val/test separation, data augmentation aligned to the chosen pipeline, and exclusion from all model selection stages when using the dataset as an external test set.
  • Researchers are encouraged to maintain harmonized preprocessing pipelines to control for inter-cohort domain shift and to adopt robust reporting standards for both segmentation and classification use cases.

References:

  • "Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network for Lung Disease Classification" (Yu et al., 6 Aug 2025)
  • "Stress-testing cross-cancer generalizability of 3D nnU-Net for PET-CT tumor segmentation: multi-cohort evaluation with novel oesophageal and lung cancer datasets" (Ghosh et al., 26 Aug 2025)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Lung-PET-CT-Dx Dataset.