MMIST-ccRCC: A Real World Medical Dataset for the Development of Multi-Modal Systems (2405.01658v1)
Abstract: The acquisition of different data modalities can enhance our knowledge and understanding of various diseases, paving the way for a more personalized healthcare. Thus, medicine is progressively moving towards the generation of massive amounts of multi-modal data (\emph{e.g,} molecular, radiology, and histopathology). While this may seem like an ideal environment to capitalize data-centric machine learning approaches, most methods still focus on exploring a single or a pair of modalities due to a variety of reasons: i) lack of ready to use curated datasets; ii) difficulty in identifying the best multi-modal fusion strategy; and iii) missing modalities across patients. In this paper we introduce a real world multi-modal dataset called MMIST-CCRCC that comprises 2 radiology modalities (CT and MRI), histopathology, genomics, and clinical data from 618 patients with clear cell renal cell carcinoma (ccRCC). We provide single and multi-modal (early and late fusion) benchmarks in the task of 12-month survival prediction in the challenging scenario of one or more missing modalities for each patient, with missing rates that range from 26$\%$ for genomics data to more than 90$\%$ for MRI. We show that even with such severe missing rates the fusion of modalities leads to improvements in the survival forecasting. Additionally, incorporating a strategy to generate the latent representations of the missing modalities given the available ones further improves the performance, highlighting a potential complementarity across modalities. Our dataset and code are available here: https://multi-modal-ist.github.io/datasets/ccRCC
- Ensembles of convolutional neural networks for survival time estimation of high-grade glioma patients from multimodal mri. Diagnostics, 12(2):345, 2022.
- S. Chen et al. Med3d: Transfer learning for 3d medical image analysis. ArXiv: Computer Vision and Pattern Recognition, 2019.
- The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging, 26:1045–1057, 2013.
- National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). The Clinical Proteomic Tumor Analysis Consortium Clear Cell Renal Cell Carcinoma Collection (CPTAC-CCRCC). In The Cancer Imaging Archive, 2018.
- Survival prediction of brain cancer with incomplete radiology, pathology, genomic, and demographic data. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 626–635, 2022.
- Prostate cancer therapy personalization via multi-modal deep learning on randomized phase iii clinical trials. npj Digital Medicine, 5(1), 2022.
- Clear cell renal cell carcinoma: associations between CT features and patient survival. AJR. American journal of roentgenology, 206(5):1023, 2016a.
- Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20908–20921, 2022.
- The Cancer Genome Atlas Kidney Renal Clear Cell Carcinoma Collection (TCGA-KIRC). In The Cancer Imaging Archive, 2016b.
- Identifying phenotypic concepts discriminating molecular breast cancer sub-types. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 276–286. Springer, 2022.
- Mining for equitable health: Assessing the impact of missing data in electronic health records. Journal of Biomedical Informatics, 139:104269, 2023.
- Systematic review on missing data imputation techniques with machine learning algorithms for healthcare. Journal of Robotics and Control (JRC), 3(2):143–152, 2022.
- Pyng Jing Lin et al. Genomic characterization of clear cell renal cell carcinoma using targeted gene sequencing. Oncology Letters, 21(2), 2021.
- Cascaded multi-modal mixing transformers for alzheimer’s disease classification with incomplete data. NeuroImage, 277:120267, 2023.
- Ming Y Lu et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering, 5(6):555–570, 2021.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- Relation-aware shared representation learning for cancer prognosis analysis with auxiliary clinical variables and incomplete multi-modality data. IEEE Transactions on Medical Imaging, 41(1):186–198, 2021.
- A comparative analysis of early and late fusion for the multimodal two-class problem. IEEE Access, 2023.
- Screening of key prognosis genes of lung adenocarcinoma based on expression analysis on tcga database. Journal of Oncology, 2022, 2022.
- Multimodal data fusion for cancer biomarker discovery with deep learning. Nature machine intelligence, 5(4):351–362, 2023.
- Molecular analysis of tcga breast cancer histologic types. Cell genomics, 1(3), 2021.
- Reclassification of tcga diffuse glioma profiles linked to transcriptomic, epigenetic, genomic and clinical data, according to the 2021 who cns tumor classification. International journal of molecular sciences, 24(1):157, 2022.
- Y. Zuo et al. Identify consistent imaging genomic biomarkers for characterizing the survival-associated interactions between tumor-infiltrating lymphocytes and tumors. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 222–231. Springer, Cham, 2022.
- Tiago Mota (3 papers)
- M. Rita Verdelho (4 papers)
- Alceu Bissoto (19 papers)
- Carlos Santiago (8 papers)
- Catarina Barata (13 papers)