Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Building Flexible, Scalable, and Machine Learning-ready Multimodal Oncology Datasets (2310.01438v2)

Published 30 Sep 2023 in cs.LG and cs.AI

Abstract: The advancements in data acquisition, storage, and processing techniques have resulted in the rapid growth of heterogeneous medical data. Integrating radiological scans, histopathology images, and molecular information with clinical data is essential for developing a holistic understanding of the disease and optimizing treatment. The need for integrating data from multiple sources is further pronounced in complex diseases such as cancer for enabling precision medicine and personalized treatments. This work proposes Multimodal Integration of Oncology Data System (MINDS) - a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources such as the Cancer Research Data Commons (CRDC) into an interconnected, patient-centric framework. MINDS offers an interface for exploring relationships across data types and building cohorts for developing large-scale multimodal machine learning models. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability to uncover diagnostic and prognostic insights and enable evidence-based personalized care. MINDS tracks granular end-to-end data provenance, ensuring reproducibility and transparency. The cloud-native architecture of MINDS can handle exponential data growth in a secure, cost-optimized manner while ensuring substantial storage optimization, replication avoidance, and dynamic access capabilities. Auto-scaling, access controls, and other mechanisms guarantee pipelines' scalability and security. MINDS overcomes the limitations of existing biomedical data silos via an interoperable metadata-driven approach that represents a pivotal step toward the future of oncology data integration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Brain Tumor Segmentation and Surveillance with Deep Artificial Neural Networks. Deep Learning for Biomedical Data Analysis, pages 311–350, 2021.
  2. Revolutionizing digital pathology with the power of generative artificial intelligence and foundation models. Laboratory Investigation, page 100255, 2023.
  3. Harnessing multimodal data integration to advance precision oncology. Nature Reviews Cancer, 22:1–13, 10 2021.
  4. Big data for health. IEEE Journal of Biomedical and Health Informatics, 19(4):1193–1208, 2015.
  5. Multimodal data integration for oncology in the era of deep neural networks: A review. arXiv preprint arXiv:2303.06471, 2023. https://arxiv.org/abs/2303.06471.
  6. Multimodal learning with transformers: A survey, 2023.
  7. Exploring Robust Architectures for Deep Artificial Neural Networks. Communications Engineering, 1(1):46, 2022.
  8. Failure detection in deep neural networks for medical imaging. Frontiers in Medical Technology, 4, 2022.
  9. Trustworthy medical segmentation with uncertainty estimation. arXiv preprint arXiv:2111.05978, 2021.
  10. Evalattai: A holistic approach to evaluating attribution maps in robust and non-robust models. arXiv preprint arXiv:2303.08866, 2023.
  11. Revisiting the fragility of influence functions. Neural Networks, 162:581–588, 2023.
  12. Kristen L. Fessele. The rise of big data in oncology. Seminars in Oncology Nursing, 34(2):168–176, 2018. Technology in Cancer Care.
  13. Common Crawl. Common crawl, 2023. Available online: https://commoncrawl.org/. (accessed on 18 September 2023).
  14. Trustworthy uncertainty propagation for sequential time-series analysis in rnns. IEEE Transactions on Knowledge and Data Engineering, 2023.
  15. Self-assessment and robust anomaly detection with bayesian deep learning. In 2022 25th International Conference on Information Fusion (FUSION), pages 1–8. IEEE, 2022.
  16. Transformers in time-series analysis: A tutorial. Circuits, Systems, and Signal Processing, 42(12):7433–7466, 2023.
  17. Intelligent helipad detection and (grad-cam) estimation using satellite imagery. Transportation Research Board, 2021.
  18. A comparison of feature selection techniques for first-day mortality prediction in the icu. In 2023 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5. IEEE, 2023.
  19. United States Congress. Health insurance portability and accountability act of 1996. https://www.govinfo.gov/content/pkg/PLAW-104publ191/pdf/PLAW-104publ191.pdf, 1996.
  20. National Cancer Institute. CCG’s Genome Characterization Pipeline. Available online: https://www.cancer.gov/ccg/research/genome-characterization-pipeline. (accessed on 18 June 2023).
  21. Toward a shared vision for cancer genomic data. New England Journal of Medicine, 375(12):1109–1112, 2016.
  22. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. Journal of digital imaging, 26:1045–1057, 2013.
  23. A comprehensive infrastructure for big data in cancer research: Accelerating cancer research and precision medicine. Frontiers in Cell and Developmental Biology, 5, 2017.
  24. Implementing the FAIR data principles in precision oncology: review of supporting initiatives. Briefings in Bioinformatics, 21(3):936–945.
  25. K Kuhn et al. The cancer biomedical informatics grid (cabig™): Infrastructure and applications for a worldwide research community. Medinfo, 1:330, 2007.
  26. transmart: an open source knowledge management and high content data analytics platform. AMIA Summits on Translational Science Proceedings, 2014:96, 2014.
  27. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). Journal of the American Medical Informatics Association, 17(2):124–130, 2010.
  28. Multimodal analysis and the oncology patient: Creating a hospital system for integrated diagnostics and discovery. Computational and Structural Biotechnology Journal, 21:4536–4539, 2023.
  29. Artificial intelligence for multimodal data integration in oncology. Cancer cell, 40(10):1095–1110, 2022.
  30. The reimagine multimodal warehouse: Using artificial intelligence for accurate risk stratification of prostate cancer. Frontiers in Artificial Intelligence, 4:769582, 2021.
  31. Nci imaging data commons. International Journal of Radiation Oncology*Biology*Physics, 111(3, Supplement):e101, 2021. 2021 Proceedings of the ASTRO 63rd Annual Meeting.
  32. Abstract LB-242: Proteomic Data Commons: A resource for proteogenomic analysis. Cancer Research, 80(16_Supplement):LB–242, 08 2020.
  33. From biobank and data silos into a data commons: convergence to support translational medicine. Journal of Translational Medicine, 19:1–13, 2021.
  34. Big data in digital healthcare: lessons learnt and recommendations for general practice. Heredity, 124(4):525–534, 2020.
  35. Juan Alberto Lecaros. Biobanks for biomedical research: Evolution and future. In Handbook of Bioethical Decisions. Volume I: Decisions at the Bench, pages 295–323. Springer, 2023.
  36. Cancer Data Aggregator. Available online: https://datacommons.cancer.gov/cancer-data-aggregator. (accessed on 15 June 2023).
  37. The cbio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer discovery, 2(5):401–404, 2012.
  38. Integrative analysis of complex cancer genomics and clinical profiles using the cbioportal. Science signaling, 6(269):pl1–pl1, 2013.
  39. The potential use of big data in oncology. Oral Oncology, 98:8–12, 2019.
  40. An overview of data warehouse and data lake in modern enterprise data management. Big Data and Cognitive Computing, 6(4), 2022.
  41. Visualizing and interpreting cancer genomics data via the xena platform. Nature biotechnology, 38(6):675–678, 2020.
  42. Amazon Web Services. Amazon S3. Available online: https://aws.amazon.com/s3/. (accessed on 1 March 2023).
  43. Amazon Web Services. AWS Lake Formation. Online. Available online: https://aws.amazon.com/lake-formation/. (accessed on 1 March 2023).
  44. Amazon Web Services. Data Catalog and crawlers in AWS Glue. Available online: https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html. (accessed on 1 March 2023).
  45. Amazon Web Services. Serverless Computing - AWS Lambda - Amazon Web Services. Available online: https://aws.amazon.com/lambda/. (accessed on 7 August 2023).
  46. Amazon Web Services. AWS Glue. Available online: https://aws.amazon.com/glue/. (accessed on 1 March 2023).
  47. The hl7 clinical document architecture. Journal of the American Medical Informatics Association, 8(6):552–569, 2001.
  48. HL7 FHIR. Available online: https://www.hl7.org/fhir/. (accessed on 1 December 2023).
  49. Clinical Data Interchange Standards Consortium. Available online: https://www.cdisc.org/. (accessed on 1 December 2023).
  50. Deven Kishor Babre. Clinical data interchange standards consortium: a bridge to overcome data standardisation, 2013.
  51. Overview of SNOMED CT. National Library of Medicine. Available online: https://www.nlm.nih.gov/healthit/snomedct/snomed_overview.html. (accessed on 1 December 2023).
  52. NCI Thesaurus. Available online: https://ncit.nci.nih.gov/ncitbrowser/. (accessed on 1 December 2023).
  53. Amazon Web Services. Amazon Redshift. Available online: https://aws.amazon.com/redshift/. (accessed on 1 March 2023).
  54. Amazon Web Services. Amazon Athena. Available online: https://aws.amazon.com/athena/. (accessed on 1 March 2023).
  55. Amazon Web Services. Amazon QuickSight. Available online: https://aws.amazon.com/quicksight/. (accessed on 1 March 2023).
  56. Amazon Web Services. Encryption at rest. Available online: https://docs.aws.amazon.com/redshift/latest/mgmt/security-server-side-encryption.html. (accessed on 7 August 2023).
  57. Amazon Web Services. Security in AWS Glue. Available online: https://docs.aws.amazon.com/glue/latest/dg/security.html. (accessed on 7 August 2023).
  58. Amazon Web Services. Amazon CloudWatch. Available online: https://aws.amazon.com/cloudwatch/. (accessed on 7 August 2023).
  59. Medical Imaging and Data Resource Center (MIDRIC). Available online: https://www.midrc.org/. (accessed on 28 November 2023).
Citations (10)

Summary

We haven't generated a summary for this paper yet.