Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors (2304.08486v2)

Published 17 Apr 2023 in cs.CV

Abstract: Medical data poses a daunting challenge for AI algorithms: it exists in many different modalities, experiences frequent distribution shifts, and suffers from a scarcity of examples and labels. Recent advances, including transformers and self-supervised learning, promise a more universal approach that can be applied flexibly across these diverse conditions. To measure and drive progress in this direction, we present BenchMD: a benchmark that tests how well unified, modality-agnostic methods, including architectures and training techniques (e.g. self-supervised learning, ImageNet pretraining),perform on a diverse array of clinically-relevant medical tasks. BenchMD combines 19 publicly available datasets for 7 medical modalities, including 1D sensor data, 2D images, and 3D volumetric scans. Our benchmark reflects real-world data constraints by evaluating methods across a range of dataset sizes, including challenging few-shot settings that incentivize the use of pretraining. Finally, we evaluate performance on out-of-distribution data collected at different hospitals than the training data, representing naturally-occurring distribution shifts that frequently degrade the performance of medical AI models. Our baseline results demonstrate that no unified learning technique achieves strong performance across all modalities, leaving ample room for improvement on the benchmark. Code is released at https://github.com/rajpurkarlab/BenchMD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Automated analysis of retinal images for detection of referable diabetic retinopathy. JAMA ophthalmology, 131(3):351–357, 2013.
  2. Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Invest. Ophthalmol. Vis. Sci., 57(13):5200–5206, Oct. 2016.
  3. Flamingo: a visual language model for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 23716–23736. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf.
  4. APTOS 2019 Blindness Detection. APTOS 2019 blindness detection. https://www.kaggle.com/competitions/aptos2019-blindness-detection/data. Accessed: 2022-11-11.
  5. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics, 38(2):915–931, 2011.
  6. Robust and efficient medical imaging with Self-Supervision. ArXiv, May 2022.
  7. A. Bandyopadhyay and C. Goldstein. Clinical applications of artificial intelligence in sleep medicine: a sleep clinician’s perspective. Sleep and Breathing, Mar. 2022.
  8. Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals. ArXiv, Jan. 2022.
  9. A new Computer-Aided diagnosis system with modified genetic feature selection for BI-RADS classification of breast masses in mammograms. Biomed Res. Int., 2020:7695207, May 2020.
  10. What will it take to fix benchmarking in natural language understanding? ArXiv, Apr. 2021.
  11. An open-access long-term wearable ecg database for premature ventricular contractions and supraventricular premature beat detection. Journal of Medical Imaging and Health Informatics, 10(11):2663–2667, 2020.
  12. W. Chiao and M. L. Durr. Trends in sleep studies performed for medicare beneficiaries. The Laryngoscope, 127(12):2891–2896, 2017.
  13. The cancer imaging archive (TCIA): maintaining and operating a public information repository. J. Digit. Imaging, 26(6):1045–1057, Dec. 2013.
  14. Bcn20000: Dermoscopic lesions in the wild. arXiv preprint arXiv:1908.02288, 2019.
  15. Feedback on a publicly distributed image database: the messidor database. Image Analysis & Stereology, 33(3):231–234, 2014.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  17. The AASM recommended and acceptable EEG montages are comparable for the staging of sleep and scoring of EEG arousals. J. Clin. Sleep Med., 10(7):803–809, July 2014.
  18. Self-Supervised representation learning: Introduction, advances, and challenges. IEEE Signal Process. Mag., 39(3):42–62, May 2022.
  19. Self-supervised learning from 100 million medical images. ArXiv, Jan. 2022.
  20. OmniMAE: Single model masked pretraining on images and videos. Arxiv, June 2022.
  21. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation, 101(23):E215–20, June 2000a.
  22. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101(23):e215–e220, 2000b.
  23. Masked autoencoders are scalable vision learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2021.
  24. Value of dermoscopy in a Population-Based screening sample by dermatologists. Dermatol Pract Concept, 9(3):200–206, July 2019.
  25. The use of deep learning towards dose optimization in low-dose computed tomography: A scoping review. Radiography, 2021.
  26. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
  27. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
  28. MIMIC-CXR database, Sept. 2019a.
  29. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):1–8, 2019b.
  30. CheXtransfer: performance and parameter efficiency of ImageNet models for chest X-Ray interpretation. In Proceedings of the Conference on Health, Inference, and Learning, CHIL ’21, pages 116–124, New York, NY, USA, Apr. 2021. Association for Computing Machinery.
  31. Isruc-sleep: A comprehensive public dataset for sleep researchers. Computer methods and programs in biomedicine, 124:180–192, 2016.
  32. WILDS: A benchmark of in-the-wild distribution shifts. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5637–5664. PMLR, 2021.
  33. Self-supervised learning in medicine and healthcare. Nat Biomed Eng, Aug. 2022.
  34. i-mix: A Domain-Agnostic strategy for contrastive representation learning. ArXiv, Oct. 2020.
  35. A curated mammography data set for use in computer-aided detection and diagnosis research. Scientific data, 4(1):1–9, 2017.
  36. PolyViT: Co-training vision transformers on images, videos and audio. Arxiv, Nov. 2021.
  37. Breast cancer screening recommendations inclusive of all women at average risk: Update from the ACR and society of breast imaging. J. Am. Coll. Radiol., 18(9):1280–1288, Sept. 2021.
  38. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations. Scientific Data, 9(1):1–7, 2022a.
  39. Vindr-mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital mammography. medRxiv, 2022b.
  40. I. of Medicine. Sleep Disorders and Sleep Deprivation: An Unmet Public Health Problem. The National Academies Press, Washington, DC, 2006. ISBN 978-0-309-10111-0. doi: 10.17226/11617. URL https://nap.nationalacademies.org/catalog/11617/sleep-disorders-and-sleep-deprivation-an-unmet-public-health-problem.
  41. OpenAI. GPT-4 technical report. Arxiv, Mar. 2023.
  42. Mapping global dynamics of benchmark creation and saturation in artificial intelligence. Nat. Commun., 13(1):6793, Nov. 2022.
  43. Pad-ufes-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data in brief, 32:106221, 2020.
  44. Cross-Domain federated learning in medical imaging. Arxiv, Dec. 2021.
  45. Lndb: a lung nodule database on computed tomography. arXiv preprint arXiv:1911.08434, 2019.
  46. The sleep heart health study: design, rationale, and methods. Sleep, 20(12):1077–1085, 1997.
  47. Ai and the everything in the whole wide world benchmark. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper-round2.pdf.
  48. A deep learning approach for diabetic retinopathy detection using transfer learning. In 2020 IEEE International Conference for Innovation in Technology (INOCON), pages 1–5, Nov. 2020.
  49. Extending the WILDS benchmark for unsupervised adaptation. Arxiv, Dec. 2021.
  50. Artificial intelligence for detection and characterization of pulmonary nodules in lung cancer CT screening: ready for practice? Transl Lung Cancer Res, 10(5):2378–2388, May 2021.
  51. K. Smith. Curated breast imaging subset of digital database for screening mammography (CBIS-DDSM) - the cancer imaging archive (TCIA) public access - cancer imaging archive wiki. https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=22516629. Accessed: 2022-11-11.
  52. Deep learning for automated sleep staging using instantaneous heart rate. NPJ digital medicine, 3(1):1–10, 2020.
  53. Applying artificial intelligence to disease staging: Deep learning for improved staging of diabetic retinopathy. PLoS One, 12(6):e0179790, June 2017.
  54. Viewmaker networks: Learning views for unsupervised representation learning. arXiv preprint arXiv:2010.07432, 2020.
  55. Dabs: A domain-agnostic benchmark for self-supervised learning. ArXiv, abs/2111.12062, 2021.
  56. Dabs 2.0: Improved datasets and algorithms for universal self-supervision. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 38358–38372. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/fa73aca7b2af724fafbd4852957cd3e0-Paper-Datasets_and_Benchmarks.pdf.
  57. The 4th Asia Pacific Tele-Ophthalmology Society Symposium. The 4th asia pacific Tele-Ophthalmology society symposium. https://2019.asiateleophth.org/. Accessed: 2022-11-11.
  58. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1–9, 2018.
  59. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  60. Ptb-xl, a large publicly available electrocardiography dataset. Scientific data, 7(1):1–15, 2020.
  61. PTB-XL, a large publicly available electrocardiography dataset, Nov. 2022.
  62. Medical Imaging: Essentials for Physicians. John Wiley & Sons, Apr. 2013.
  63. Wild-Time: A benchmark of in-the-wild distribution shift over time. Oct. 2022.
  64. A large-scale study of representation learning with the visual task adaptation benchmark. Arxiv, Oct. 2019.
  65. A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. Scientific data, 7(1):1–8, 2020.
  66. Self pre-training with masked autoencoders for medical image analysis. Arxiv, Mar. 2022.
  67. Models genesis. Med. Image Anal., 67:101840, Jan. 2021.
  68. Automatic multilabel electrocardiogram diagnosis of heart rhythm or conduction abnormalities with deep learning: a cohort study. Lancet Digit Health, 2(7):e348–e357, July 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Kathryn Wantlin (2 papers)
  2. Chenwei Wu (23 papers)
  3. Shih-Cheng Huang (17 papers)
  4. Oishi Banerjee (7 papers)
  5. Farah Dadabhoy (1 paper)
  6. Veeral Vipin Mehta (1 paper)
  7. Ryan Wonhee Han (1 paper)
  8. Fang Cao (2 papers)
  9. Raja R. Narayan (1 paper)
  10. Errol Colak (14 papers)
  11. Adewole Adamson (1 paper)
  12. Laura Heacock (13 papers)
  13. Geoffrey H. Tison (7 papers)
  14. Alex Tamkin (29 papers)
  15. Pranav Rajpurkar (69 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub