CT-3M Dataset: COVID-19 & 3D Matting Benchmarks
- CT-3M dataset is a dual benchmark combining a large-scale COVID-19 detection CT collection with a high-quality 3D soft nodule segmentation dataset.
- The COVIDx CT-3 subset offers 431,205 CT slices from 6,068 patients across 17+ countries, featuring expert-labeled data and standardized protocols.
- The CT-3M 3D Medical Matting Dataset provides detailed lung nodule annotations with soft segmentation, enabling improved radiomics and treatment planning.
The CT-3M designation refers to two distinct, open-source computed tomography (CT) datasets, each serving as a benchmark in different domains of medical image analysis: COVIDx CT-3 (focused on COVID-19 detection) and the CT-3M 3D Medical Matting Dataset (focused on soft segmentation of lung nodules). Both datasets are multinational, curated to address specific limitations in their respective fields, and released for academic and developing machine learning applications.
1. Dataset Definitions and Intended Use
COVIDx CT-3 ("CT-3M" in some literature) is a large-scale multinational benchmark for supervised COVID-19 case detection from chest CT, designed to maximize coverage in both case diversity and imaging parameters. It contains 431,205 axial CT slices from 6,068 patients and is curated for the development and validation of machine learning systems for CT-based COVID-19 screening, facilitating model benchmarking and comparison under consistent protocols.
CT-3M 3D Medical Matting Dataset is the first high-quality annotated medical imaging dataset for 3D matting, containing soft (alpha-matte) and binary nodule segmentations. Its primary use is the development and evaluation of algorithms for soft segmentation of lung nodules, particularly those with ambiguous or fuzzy boundaries (e.g., ground-glass nodules), supporting downstream applications in radiomics, malignancy classification, and radiotherapy planning.
2. Data Collection, Structure, and Preprocessing
COVIDx CT-3
- Source and Diversity: Data is aggregated from 17+ countries via 11 public initiatives (notably CNCB, ITAC/TCIA, LIDC-IDRI, MosMedData). Most patients are from China (42.2%), France (19.4%), Russia (12.5%), Iran (11.8%), with the remainder from the US, Australia, Algeria, and Italy.
- Volume and Class Composition: 431,205 2D slices from 6,068 patients distributed as 16.6% normal (control), 10.0% CAP (other pneumonia), and 73.4% COVID-19.
- Acquisition Protocols: Imaging parameters vary due to the multi-institutional pooling; slice thickness ~1–5 mm, in-plane resolution 512×512, tube voltage/current and reconstruction kernels depend on site.
- Labeling: Ground truth reflects patient-level RT-PCR/clinical diagnosis, mapped to slices. Slices without explicit labels leverage segmentation masks, non-expert labeling, or pre-trained model-based inference.
- Preprocessing: Slices are intensity-windowed [–1200 HU, +600 HU], linearly normalized [0,1], resized/cropped to 224×224 pixels, with no further denoising.
CT-3M 3D Medical Matting Dataset
- Source: All data are derived from LIDC-IDRI, encompassing 542 patients with 864 annotated nodule volumes (one cropped 3D patch per nodule).
- Acquisition: Thoracic CT, non-contrast, slice thickness 1–3 mm, in-plane resolution ~0.5–0.7 mm, multi-vendor (GE, Siemens, Philips).
- Inclusion/Exclusion: Nodules ≥3 mm, annotated by four radiologists. Exclude nodules that cannot yield an acceptable alpha-matte from the four matting algorithms.
- Annotation Pipeline: Four binary masks per nodule form a 3D trimap (foreground = intersection, background = outside union, unknown = dilated rest). Four matting algorithms—CF+, KNN+, LB+, IF+—extended to 3D produce soft segmentations. An expert selects the most plausible alpha matte per nodule.
- Data Format: All scans and masks are NIfTI; naming retains patient/nodule identity. Each nodule has: image, alpha-matte (soft), union mask, overlap mask (binary).
- Preprocessing: Recommended to clip to [–1000, 400] HU, min–max normalize [0,1], resample to isotropic 0.5 mm, and crop 128×128×N around nodules.
3. Benchmark Splits, Evaluation Protocols, and Metrics
COVIDx CT-3
- Splits: Slices divided train/val/test ≈ 84%/8%/8%. Training leverages all labeling sources; validation and test are expert-labeled only.
| Split | Normal (Patients) | CAP (Patients) | COVID-19 (Patients) | Total (Patients) |
|---|---|---|---|---|
| Train | 35,996 (321) | 26,970 (592) | 300,733 (4,092) | 363,699 (5,005) |
| Validation | 17,570 (164) | 8,008 (202) | 8,147 (194) | 33,725 (560) |
| Test | 17,922 (164) | 7,965 (138) | 7,894 (201) | 33,781 (503) |
- Benchmarks: Several CNNs (SqueezeNet, MobileNetV2, EfficientNet-B0, NASNet-A-Mobile, COVID-Net CT L/S) are evaluated. EfficientNet-B0 achieves the top image-level sensitivity (99.1%) and overall accuracy (99.0%).
- Metric Definitions:
- Accuracy:
- Sensitivity:
- Specificity:
CT-3M 3D Medical Matting Dataset
- Splits: 7:1:2 train/val/test by patient ID (Training: 605, Validation: 86, Test: 173 volumes).
- Metrics: Dice coefficient, Sum of Absolute Differences (SAD), Mean Squared Error (MSE), Intersection over Union (IoU), Gradient error, Connectivity error.
| Model | SAD×10⁻² | MSE×10⁻³ | Grad×10⁻² | Conn×10⁻² |
|---|---|---|---|---|
| CF+ | 152.62 | 0.43 | 14.96 | 132.39 |
| KNN+ | 102.22 | 0.25 | 13.78 | 76.01 |
| LB+ | 86.31 | 0.16 | 6.23 | 66.89 |
| IF+ | 79.71 | 0.18 | 7.88 | 69.09 |
| 3DMM | 99.42 | 0.24 | 6.37 | 69.25 |
- Definitions:
- Dice:
- SAD:
- MSE:
- Grad:
4. Data Diversity, Bias, and Statistical Properties
COVIDx CT-3
- Class Imbalance: COVID-19 cases dominate; normal and CAP are underrepresented (approx. 3:1 and 7:1 ratios to COVID-19, respectively).
- Geography: 85.9% of patients are from four countries; substantial skew in population representation.
- Patient Metadata: Age or sex unknown for ~51%. Of known cases, ages [30–89] predominate; of identifiable sex, the split is roughly 27% male / 22% female.
- Statistical Imbalance: Kullback–Leibler divergence is used to quantify class or geographical imbalance:
- Bias Implications: Potential risk that models will learn scanner-, country-, or protocol-specific statistical artifacts rather than actual disease patterns.
CT-3M 3D Medical Matting Dataset
- Nodule Types: 450 solid, 214 part-solid, 200 ground-glass volumes.
- Shape/Alpha Stats: Lesion volumes: 50–5,000 voxels (mean 1,100 ± 850); sphericity 0.75 ± 0.10; surface area–to–volume 12.5 ± 3.0 mm⁻¹. Alpha occupancy: 28% with , 18% , 22% , 32% .
- Demographics: Patients –$80$ years old, approximate male/female parity.
5. Quality Control, Licensing, and Access
COVIDx CT-3
- Quality Control: Exclusion of severe artifact/incomplete volumes. Expert-only labeling for validation/testing.
- Availability: Open license (CC BY-4.0 or similar); downloadable at https://github.com/labsyspharm/COVIDx_CT-3.
- Best Practices: Users are urged to re-balance losses/sampling, use expert-labeled splits for validation, and report stratified performance to expose potential bias.
CT-3M 3D Medical Matting
- Quality Assurance: Automated checks (NaN, out-of-range, cropping), manual QC (visual edge/continuity inspection).
- Access: MIT-style open-source license, no registration required; hosted at https://github.com/wangsssky/3DMatting.
- Citation: Users are expected to cite the original publication when employing the dataset.
6. Recommended Processing and Usage Protocols
COVIDx CT-3
- Preprocessing: Intensity windowing [–1200, +600] HU, normalization, resizing.
- Model Training: Employ data augmentation (volume rotation, intensity scaling), stratify training/validation, apply class/sample re-weighting. Fine-tuning entails freezing early CNN layers to prevent overfitting to acquisition noise.
CT-3M 3D Medical Matting
- Preprocessing: Clip intensities, resample to 0.5 mm isotropic voxels, crop 128×128×N volumes.
- Augmentation: Flip, rotate (±15°), random spatial cropping per training run.
- Post-processing: Threshold alpha at 0.5 for binary extraction, followed by 3D median filtering and morphological closing.
7. Context, Impact, and Applications
COVIDx CT-3 enables robust benchmarking of COVID-19 detection models on a large, multinational cohort, but distributional biases necessitate careful evaluation of generalization, especially in geographically or demographically underrepresented populations. Recommended mitigations include sample re-weighting and domain adversarial methods.
The CT-3M 3D Medical Matting Dataset introduces soft segmentation for ambiguous radiological targets, facilitating development of algorithms capable of modeling partial-volume effects and ambiguous lesion boundaries. It also provides a reference standard for comparative evaluation of 3D matting algorithms, with implications for improved malignancy risk stratification and treatment planning.
Both CT-3M datasets address key limitations in training data diversity, annotation quality, and benchmarking consistency, substantially advancing reproducibility and methodological rigor in CT-based AI research (Gunraj et al., 2022, Wang et al., 2022).