Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMIST-ccRCC: A Real World Medical Dataset for the Development of Multi-Modal Systems (2405.01658v1)

Published 2 May 2024 in eess.IV and cs.CV

Abstract: The acquisition of different data modalities can enhance our knowledge and understanding of various diseases, paving the way for a more personalized healthcare. Thus, medicine is progressively moving towards the generation of massive amounts of multi-modal data (\emph{e.g,} molecular, radiology, and histopathology). While this may seem like an ideal environment to capitalize data-centric machine learning approaches, most methods still focus on exploring a single or a pair of modalities due to a variety of reasons: i) lack of ready to use curated datasets; ii) difficulty in identifying the best multi-modal fusion strategy; and iii) missing modalities across patients. In this paper we introduce a real world multi-modal dataset called MMIST-CCRCC that comprises 2 radiology modalities (CT and MRI), histopathology, genomics, and clinical data from 618 patients with clear cell renal cell carcinoma (ccRCC). We provide single and multi-modal (early and late fusion) benchmarks in the task of 12-month survival prediction in the challenging scenario of one or more missing modalities for each patient, with missing rates that range from 26$\%$ for genomics data to more than 90$\%$ for MRI. We show that even with such severe missing rates the fusion of modalities leads to improvements in the survival forecasting. Additionally, incorporating a strategy to generate the latent representations of the missing modalities given the available ones further improves the performance, highlighting a potential complementarity across modalities. Our dataset and code are available here: https://multi-modal-ist.github.io/datasets/ccRCC

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Ensembles of convolutional neural networks for survival time estimation of high-grade glioma patients from multimodal mri. Diagnostics, 12(2):345, 2022.
  2. S. Chen et al. Med3d: Transfer learning for 3d medical image analysis. ArXiv: Computer Vision and Pattern Recognition, 2019.
  3. The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging, 26:1045–1057, 2013.
  4. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). The Clinical Proteomic Tumor Analysis Consortium Clear Cell Renal Cell Carcinoma Collection (CPTAC-CCRCC). In The Cancer Imaging Archive, 2018.
  5. Survival prediction of brain cancer with incomplete radiology, pathology, genomic, and demographic data. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 626–635, 2022.
  6. Prostate cancer therapy personalization via multi-modal deep learning on randomized phase iii clinical trials. npj Digital Medicine, 5(1), 2022.
  7. Clear cell renal cell carcinoma: associations between CT features and patient survival. AJR. American journal of roentgenology, 206(5):1023, 2016a.
  8. Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20908–20921, 2022.
  9. The Cancer Genome Atlas Kidney Renal Clear Cell Carcinoma Collection (TCGA-KIRC). In The Cancer Imaging Archive, 2016b.
  10. Identifying phenotypic concepts discriminating molecular breast cancer sub-types. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 276–286. Springer, 2022.
  11. Mining for equitable health: Assessing the impact of missing data in electronic health records. Journal of Biomedical Informatics, 139:104269, 2023.
  12. Systematic review on missing data imputation techniques with machine learning algorithms for healthcare. Journal of Robotics and Control (JRC), 3(2):143–152, 2022.
  13. Pyng Jing Lin et al. Genomic characterization of clear cell renal cell carcinoma using targeted gene sequencing. Oncology Letters, 21(2), 2021.
  14. Cascaded multi-modal mixing transformers for alzheimer’s disease classification with incomplete data. NeuroImage, 277:120267, 2023.
  15. Ming Y Lu et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering, 5(6):555–570, 2021.
  16. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  17. Relation-aware shared representation learning for cancer prognosis analysis with auxiliary clinical variables and incomplete multi-modality data. IEEE Transactions on Medical Imaging, 41(1):186–198, 2021.
  18. A comparative analysis of early and late fusion for the multimodal two-class problem. IEEE Access, 2023.
  19. Screening of key prognosis genes of lung adenocarcinoma based on expression analysis on tcga database. Journal of Oncology, 2022, 2022.
  20. Multimodal data fusion for cancer biomarker discovery with deep learning. Nature machine intelligence, 5(4):351–362, 2023.
  21. Molecular analysis of tcga breast cancer histologic types. Cell genomics, 1(3), 2021.
  22. Reclassification of tcga diffuse glioma profiles linked to transcriptomic, epigenetic, genomic and clinical data, according to the 2021 who cns tumor classification. International journal of molecular sciences, 24(1):157, 2022.
  23. Y. Zuo et al. Identify consistent imaging genomic biomarkers for characterizing the survival-associated interactions between tumor-infiltrating lymphocytes and tumors. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 222–231. Springer, Cham, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Tiago Mota (3 papers)
  2. M. Rita Verdelho (4 papers)
  3. Alceu Bissoto (19 papers)
  4. Carlos Santiago (8 papers)
  5. Catarina Barata (13 papers)