Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review (2403.02469v2)

Published 4 Mar 2024 in cs.CV and cs.LG

Abstract: Medical vision-LLMs (VLMs) combine computer vision (CV) and NLP to analyze visual and textual medical data. Our paper reviews recent advancements in developing VLMs specialized for healthcare, focusing on models designed for medical report generation and visual question answering (VQA). We provide background on NLP and CV, explaining how techniques from both fields are integrated into VLMs to enable learning from multimodal data. Key areas we address include the exploration of medical vision-language datasets, in-depth analyses of architectures and pre-training strategies employed in recent noteworthy medical VLMs, and comprehensive discussion on evaluation metrics for assessing VLMs' performance in medical report generation and VQA. We also highlight current challenges and propose future directions, including enhancing clinical validity and addressing patient privacy concerns. Overall, our review summarizes recent progress in developing VLMs to harness multimodal medical data for improved healthcare applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (179)
  1. “Overview of the VQA-Med Task at ImageCLEF 2020: Visual Question Answering and Generation in the Medical Domain” In CLEF 2020 Working Notes, CEUR Workshop Proceedings, 2020
  2. “VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019” In Conference and Labs of the Evaluation Forum, 2019 URL: https://api.semanticscholar.org/CorpusID:198489641
  3. “Multimodal Biomedical AI” In Nature Medicine 28.9, 2022, pp. 1773–1784
  4. “Flamingo: A Visual Language Model for Few-Shot Learning” In Advances in Neural Information Processing Systems, 2022 URL: https://api.semanticscholar.org/CorpusID:248476411
  5. “2017 Robotic Instrument Segmentation Challenge”, 2019 arXiv:1902.06426
  6. “2018 Robotic Scene Segmentation Challenge”, 2020 arXiv:2001.11190
  7. “VQA: Visual Question Answering” In IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2425–2433 DOI: 10.1109/ICCV.2015.279
  8. “Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond”, 2023 arXiv:2308.12966
  9. Long Bai, Mobarakol Islam and Hongliang Ren “CAT-ViL: Co-attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery” In Medical Image Computing and Computer Assisted Intervention – MICCAI, 2023, pp. 397–407
  10. “Artificial Intelligence in Healthcare: Transforming the Practice of Medicine” In Future Healthcare Journal 8.2, 2021, pp. e188–e194 DOI: 10.7861/fhj.2021-0095
  11. Pierre Baldi “Deep Learning in Science” Cambridge University Press, 2021
  12. “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments” In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, pp. 65–72 URL: https://www.aclweb.org/anthology/W05-0909
  13. “Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing”, 2023 arXiv:2301.04558
  14. “Vision–Language Model for Visual Question Answering in Medical Imagery” In Bioengineering 10.3, 2023 URL: https://doi.org/10.3390/bioengineering10030380
  15. “Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data” In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 25, 2020, pp. 295–306
  16. Asma Ben Abacha, Chaitanya Shivade and Dina Demner-Fushman “Overview of the MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question Answering” In BioNLP Workshop and Shared Task, 2019, pp. 370–379 DOI: 10.18653/v1/W19-5039
  17. “REFLACX, a Dataset of Reports and Eye-tracking Data for Localization of Abnormalities in Chest X-rays” In Scientific Data 9.1, 2022 DOI: 10.1038/s41597-022-01441-z
  18. “Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing” In Computer Vision – ECCV, 2022, pp. 1–21 DOI: https://doi.org/10.1007/978-3-031-20059-5˙1
  19. “Harnessing Multimodal Data Integration to Advance Precision Oncology” In Nature Reviews Cancer 22, 2021, pp. 114–126 URL: https://www.nature.com/articles/s41568-021-00408-3
  20. “Enriching Word Vectors with Subword Information” In Transactions of the Association for Computational Linguistics 5, 2017, pp. 135–146
  21. “Language Models are Few-Shot Learners” In Advances in Neural Information Processing Systems 33, 2020
  22. “Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks”, 2024 arXiv:2401.15275
  23. “End-to-End Object Detection with Transformers” In European conference on computer vision, 2020, pp. 213–229
  24. “VLP: A Survey on Vision-Language Pre-Training” In Machine Intelligence Research 20, 2023, pp. 38–56
  25. “A Simple Framework for Contrastive Learning of Visual Representations”, 2020 arXiv:2002.05709
  26. “UNITER: UNiversal Image-TExt Representation Learning” In European Conference on Computer Vision, 2019 URL: https://api.semanticscholar.org/CorpusID:216080982
  27. “Reproducible Scaling Laws for Contrastive Language-Image Learning”, 2022 arXiv:2212.07143
  28. “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality”, 2023 URL: https://lmsys.org/blog/2023-03-30-vicuna/
  29. “Unifying Vision-and-Language Tasks via Text Generation” In International Conference on Machine Learning, 2021 URL: https://proceedings.mlr.press/v139/cho21a.html
  30. “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation” In Conference on Empirical Methods in Natural Language Processing, 2014 URL: https://api.semanticscholar.org/CorpusID:5590763
  31. “PaLM: Scaling Language Modeling with Pathways” In Journal of Machine Learning Research 24, 2022, pp. 240:1–240:113 URL: https://api.semanticscholar.org/CorpusID:247951931
  32. “InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning”, 2023 arXiv:2305.06500
  33. Tri Dao “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”, 2023 arXiv:2307.08691
  34. “Preparing a Collection of Radiology Examinations for Distribution and Retrieval” In Journal of the American Medical Informatics Association (JAMIA) 23 2, 2015, pp. 304–10 URL: https://api.semanticscholar.org/CorpusID:16941525
  35. “ImageNet: A Large-Scale Hierarchical Image Database” In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255 DOI: 10.1109/CVPR.2009.5206848
  36. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding” In Conference of the North American Chapter of the Association for Computational Linguistics 1, 2019 DOI: 10.18653/v1/N19-1423
  37. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” In International Conference on Learning Representations, 2021 URL: https://openreview.net/forum?id=YicbFdNTTy
  38. “An Empirical Study of Training End-to-End Vision-and-Language Transformers” In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18145–18155 DOI: 10.1109/CVPR52688.2022.01763
  39. Sedigheh Eslami, Christoph Meinel and Gerard De Melo “PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain?” In Findings of the Association for Computational Linguistics, 2023, pp. 1151–1163
  40. Patrick Esser, Robin Rombach and Björn Ommer “Taming Transformers for High-Resolution Image Synthesis” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12868–12878 DOI: 10.1109/CVPR46437.2021.01268
  41. “Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”, 2022 arXiv:2210.09263
  42. “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation” In IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587 URL: https://api.semanticscholar.org/CorpusID:215827080
  43. “A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models”, 2023 arXiv:2307.12980
  44. “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing” In ACM Transactions on Computing for Healthcare 3.1, 2021, pp. 23 DOI: 10.1145/3458754
  45. “MedAlpaca – An Open-Source Collection of Medical Conversational AI Models and Training Data”, 2023 arXiv:2304.08247
  46. “Probabilistic Predictions of People Perusing: Evaluating Metrics of Language Model Performance for Psycholinguistic Modeling”, 2020 arXiv:2009.03954
  47. “A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics”, 2023 arXiv:2310.05694
  48. “Deep Residual Learning for Image Recognition” In IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778 DOI: 10.1109/CVPR.2016.90
  49. “Mask R-CNN” In IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980–2988 DOI: 10.1109/ICCV.2017.322
  50. “PathVQA: 30000+ Questions for Medical Visual Question Answering”, 2020 arXiv:2003.10286
  51. “Long Short-Term Memory” In Neural Computation 9.8, 1997, pp. 1735–1780 DOI: 10.1162/neco.1997.9.8.1735
  52. “LoRA: Low-Rank Adaptation of Large Language Models” In International Conference on Learning Representations, 2022 URL: https://openreview.net/forum?id=nZeVKeeFYf9
  53. “Convolutional Networks with Dense Connectivity” In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019
  54. “What Makes Multimodal Learning Better than Single (Provably)” In Advances in Neural Information Processing Systems, 2021 URL: https://api.semanticscholar.org/CorpusID:235367766
  55. “Overview of the ImageCLEF 2021: Multimedia Retrieval in Medical, Nature, Internet and Social Media Applications” In Experimental IR Meets Multilinguality, Multimodality, and Interaction, 2021, pp. 345–370 URL: https://link.springer.com/chapter/10.1007/978-3-030-85251-1_23
  56. “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison” In AAAI Conference on Artificial Intelligence, 2019 URL: https://api.semanticscholar.org/CorpusID:58981871
  57. “Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report Generation”, 2023 arXiv:2303.17579
  58. Qiang Ji “5 - Computer vision applications” In Probabilistic Graphical Models for Computer Vision, Computer Vision and Pattern Recognition, 2020, pp. 191–297 DOI: https://doi.org/10.1016/B978-0-12-803467-5.00010-1
  59. “Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision” In International Conference on Machine Learning, 2021
  60. “Mistral 7B”, 2023 arXiv:2310.06825
  61. “What Disease does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams” In Applied Sciences 11.14, 2021, pp. 6421
  62. “PubMedQA: A Dataset for Biomedical Research Question Answering” In Conference on Empirical Methods in Natural Language Processing, 2019 URL: https://api.semanticscholar.org/CorpusID:202572622
  63. “MIMIC-CXR-JPG, a Large Publicly Available Database of Labeled Chest Radiographs”, 2019 arXiv:1901.07042
  64. “MIMIC-CXR, a De-Identified Publicly Available Database of Chest Radiographs with Free-Text Reports” In Scientific Data 6, 2019
  65. “Explaining Chest X-ray Pathologies in Natural Language” In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 13435, 2022, pp. 701–713 URL: https://doi.org/10.1007/978-3-031-16443-9_67
  66. Jin-Hwa Kim, Jaehyun Jun and Byoung-Tak Zhang “Bilinear Attention Networks” In Advances in Neural Information Processing Systems 31, 2018, pp. 1571–1581 URL: https://proceedings.neurips.cc/paper_files/paper/2018/file/96ea64f3a1aa2fd00c72faacf0cb8ac9-Paper.pdf
  67. Diederik P. Kingma and Jimmy Ba “Adam: A Method for Stochastic Optimization” In International Conference on Learning Representations abs/1412.6980, 2014
  68. “Masked Vision and Language Modeling for Multi-modal Representation Learning”, 2023 arXiv:2208.02131
  69. “ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations” In International Conference on Learning Representations, 2019
  70. “A Dataset of Clinically Generated Visual Questions and Answers about Radiology Images” In Scientific data 5.1, 2018, pp. 1–10 URL: https://www.nature.com/articles/sdata2018251
  71. “UniXGen: A Unified Vision-Language Model for Multi-View Chest X-ray Generation and Report Generation”, 2023 arXiv:2302.12172
  72. Brian Lester, Rami Al-Rfou and Noah Constant “The Power of Scale for Parameter-Efficient Prompt Tuning”, 2021 arXiv:2104.08691
  73. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” In Neural Information Processing Systems, 2020 URL: https://dl.acm.org/doi/abs/10.5555/3495724.3496517
  74. “LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day”, 2023 arXiv:2306.00890
  75. “Align Before Fuse: Vision and Language Representation Learning with Momentum Distillation” In Advances in Neural Information Processing Systems, 2021 URL: https://proceedings.neurips.cc/paper/2021/file/505259756244493872b7709a8a01b536-Paper.pdf
  76. “VisualBERT: A Simple and Performant Baseline for Vision and Language”, 2019 arXiv:1908.03557
  77. “Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation” In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20624–20633 DOI: 10.1109/CVPR52688.2022.02000
  78. “Masked Vision and Language Pre-Training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering” In Medical Image Computing and Computer Assisted Intervention (MICCAI), 2023, pp. 374–383 URL: https://link.springer.com/chapter/10.1007/978-3-031-43907-0_36
  79. Xiang Lisa Li and Percy Liang “Prefix-Tuning: Optimizing Continuous Prompts for Generation”, 2021 arXiv:2101.00190
  80. “ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge” In Cureus 15.6, 2023 URL: https://www.cureus.com/articles/152858-chatdoctor-a-medical-chat-model-fine-tuned-on-a-large-language-model-meta-ai-llama-using-medical-domain-knowledge#!/
  81. Chin-Yew Lin “ROUGE: A Package for Automatic Evaluation of Summaries” In Text Summarization Branches Out, 2004, pp. 74–81 URL: https://aclanthology.org/W04-1013
  82. “PMC-CLIP: Contrastive Language-Image Pre-Training using Biomedical Documents”, 2023 arXiv:2303.07240
  83. “Medical Visual Question Answering: A Survey” In Artificial Intelligence in Medicine 143, 2023, pp. 102611 DOI: 10.1016/j.artmed.2023.102611
  84. “Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning” In IEEE Transactions on Medical Imaging 42.5, 2023, pp. 1532–1545 DOI: 10.1109/TMI.2022.3232411
  85. “Slake: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering” In IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021, pp. 1650–1654 URL: https://api.semanticscholar.org/CorpusID:231951663
  86. Chang Liu, Yuanhe Tian and Yan Song “A Systematic Review of Deep Learning-based Research on Radiology Report Generation”, 2023 arXiv:2311.14199
  87. “DePlot: One-Shot Visual Language Reasoning by Plot-to-Table Translation”, 2022 arXiv:2212.10505
  88. “A Survey on Hallucination in Large Vision-Language Models”, 2024 arXiv:2402.00253
  89. “Visual Instruction Tuning”, 2023 arXiv:2304.08485
  90. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” In International Conference on Computer Vision (ICCV), 2021, pp. 9992–10002 DOI: 10.1109/ICCV48922.2021.00986
  91. “S2ORC: The Semantic Scholar Open Research Corpus” In Annual Meeting of the Association for Computational Linguistics, 2020, pp. 4969–4983 DOI: 10.18653/v1/2020.acl-main.447
  92. “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks” In Advances in Neural Information Processing Systems, 2019 URL: https://api.semanticscholar.org/CorpusID:199453025
  93. Thusitha Mabotuwana, Christopher S Hall and Nathan Cross “Framework for Extracting Critical Findings in Radiology Reports” In Journal of Digital Imaging 33.4, 2020 URL: https://link.springer.com/article/10.1007/s10278-020-00349-7
  94. “A Review on Machine Learning Styles in Computer Vision - Techniques and Future Directions” In IEEE Access 10, 2022, pp. 107293–107329 DOI: 10.1109/ACCESS.2022.3209825
  95. “MedViT: A Robust Vision Transformer for Generalized Medical Image Classification” In Computers in Biology and Medicine 157, 2023, pp. 106791 DOI: https://doi.org/10.1016/j.compbiomed.2023.106791
  96. “Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction” In International Conference on Artificial Neural Networks, 2011 URL: https://api.semanticscholar.org/CorpusID:12640199
  97. “Distributed Representations of Words and Phrases and Their Compositionality” In Advances in Neural Information Processing Systems 26, 2013
  98. “Distributed Representations of Words and Phrases and their Compositionality”, 2013 arXiv:1310.4546
  99. “Efficient Estimation of Word Representations in Vector Space”, 2013 arXiv:1301.3781
  100. “VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization” In IEEE International Symposium on Industrial Electronics (ISIE), 2021, pp. 01–06 URL: https://api.semanticscholar.org/CorpusID:233307063
  101. “Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation” In North American Chapter of the Association for Computational Linguistics, 2021, pp. 5288–5304 DOI: 10.18653/v1/2021.naacl-main.416
  102. “Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training” In IEEE Journal of Biomedical and Health Informatics 26.12, 2022, pp. 6070–6080 DOI: 10.1109/JBHI.2022.3207502
  103. “Med-Flamingo: A Multimodal Medical Few-Shot Learner”, 2023 arXiv:2307.15189
  104. Aaron Oord, Yazhe Li and Oriol Vinyals “Representation Learning with Contrastive Predictive Coding”, 2019 arXiv:1807.03748
  105. “Bleu: a Method for Automatic Evaluation of Machine Translation” In Annual Meeting of the Association for Computational Linguistics, 2002 URL: https://api.semanticscholar.org/CorpusID:11080756
  106. “A Review on Autonomous Vehicles: Progress, Methods and Challenges” In Electronics 11.14, 2022 DOI: 10.3390/electronics11142162
  107. “Radiology Objects in COntext (ROCO): A Multimodal Image Dataset” In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis Springer International Publishing, 2018, pp. 180–189 URL: https://link.springer.com/chapter/10.1007/978-3-030-01364-6_20
  108. “RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance”, 2023 arXiv:2311.18681
  109. “NegBio: A High-Performance Tool for Negation and Uncertainty Detection in Radiology Reports” In AMIA Summits on Translational Science Proceedings 2018, 2017, pp. 188–196 URL: https://api.semanticscholar.org/CorpusID:19572090
  110. Jeffrey Pennington, Richard Socher and Christopher Manning “Glove: Global Vectors for Word Representation” In Empirical Methods in Natural Language Processing 14, 2014, pp. 1532–1543 DOI: 10.3115/v1/D14-1162
  111. “Learning Transferable Visual Models from Natural Language Supervision”, 2021 arXiv:2103.00020
  112. “Study of Various Methods for Tokenization” In Applications of Internet of Things, 2021, pp. 193–200 DOI: https://doi.org/10.1007/978-981-15-6198-6˙18
  113. Vignav Ramesh, Nathan Chi and Pranav Rajpurkar “Improving Radiology Report Generation Systems by Removing Hallucinated References to Non-existent Priors” In Machine Learning Research 193, 2022, pp. 456–473 URL: https://proceedings.mlr.press/v193/ramesh22a/ramesh22a.pdf
  114. René Ranftl, Alexey Bochkovskiy and Vladlen Koltun “Vision Transformers for Dense Prediction” In IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 12159–12168 URL: https://api.semanticscholar.org/CorpusID:232352612
  115. “Self-supervised Learning: A Succinct Review” In Archives of Computational Methods in Engineering 30, 2023 DOI: 10.1007/s11831-023-09884-2
  116. “Retrieval Augmented Chest X-Ray Report Generation using OpenAI GPT models”, 2023 arXiv:2305.03660
  117. “You Only Look Once: Unified, Real-Time Object Detection” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788 DOI: 10.1109/CVPR.2016.91
  118. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks” In IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 2015, pp. 1137–1149 URL: https://api.semanticscholar.org/CorpusID:10328909
  119. “Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression” In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 658–666
  120. Herbert E. Robbins “A Stochastic Approximation Method” In Annals of Mathematical Statistics 22, 1951, pp. 400–407
  121. “Lessons from Natural Language Inference in the Clinical Domain” In Conference on Empirical Methods in Natural Language Processing, 2018, pp. 1586–1596 DOI: 10.18653/v1/D18-1187
  122. Olaf Ronneberger, Philipp Fischer and Thomas Brox “U-Net: Convolutional Networks for Biomedical Image Segmentation” In Medical Image Computing and Computer-Assisted Intervention – MICCAI, 2015, pp. 234–241
  123. “Overview of ImageCLEFmedical 2022 – Caption Prediction and Concept Detection” In Conference and Labs of the Evaluation Forum (CLEF), 2022
  124. Robin M. Schmidt “Recurrent Neural Networks (RNNs): A gentle Introduction and Overview”, 2019 arXiv:1912.05911
  125. “Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer” In Medical Image Computing and Computer Assisted Intervention – MICCAI, 2022, pp. 33–43
  126. Saurav Sengupta and Donald E. Brown “Automatic Report Generation for Histopathology images using pre-trained Vision Transformers and BERT”, 2023 arXiv:2312.01435
  127. Rico Sennrich, Barry Haddow and Alexandra Birch “Neural Machine Translation of Rare Words with Subword Units” In 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1715–1725 DOI: 10.18653/v1/P16-1162
  128. Dhruv Sharma, Chhavi Dhiman and Dinesh Kumar “Evolution of Visual Data Captioning Methods, Datasets, and Evaluation Metrics: a Comprehensive Survey” In Expert Systems with Applications 221, 2023, pp. 119773 DOI: https://doi.org/10.1016/j.eswa.2023.119773
  129. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”, 2019 arXiv:1909.08053
  130. “Medical Vision Language Pretraining: A survey”, 2023 arXiv:2312.06224
  131. “Visual Med-Alpaca: A Parameter-Efficient Biomedical LLM with Visual Capabilities” [Online; accessed 20-Feb-2024], 2023 URL: https://cambridgeltl.github.io/visual-med-alpaca/
  132. “Large Language Models Encode Clinical Knowledge” In Nature 620, 2023, pp. 172–180
  133. “CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT”, 2020 arXiv:2004.09167
  134. “Curriculum Learning: A Survey” In International Journal of Computer Vision 130, 2021, pp. 1526–1565 URL: https://api.semanticscholar.org/CorpusID:231709290
  135. “MedICaT: A Dataset of Medical Images, Captions, and Textual References” In Findings of EMNLP, 2020
  136. “Aligning Large Multimodal Models with Factually Augmented RLHF”, 2023 arXiv:2309.14525
  137. Mingxing Tan and Quoc V. Le “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, 2020 arXiv:1905.11946
  138. Ajay K. Tanwani, Joelle Barral and Daniel Freedman “RepsNet: Combining Vision with Language for Automated Medical Reports” In Medical Image Computing and Computer Assisted Intervention (MICCAI), 2022, pp. 714–724 DOI: https://doi.org/10.1007/978-3-031-16443-9˙68
  139. Wilson L. Taylor ““Cloze Procedure”: A New Tool for Measuring Readability” In Journalism & Mass Communication Quarterly 30, 1953, pp. 415–433
  140. “XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models”, 2023 arXiv:2306.07971
  141. Pang Ting, Peigao Li and Lijie Zhao “A Survey on Automatic Generation of Medical Imaging Reports based on Deep Learning” In BioMedical Engineering OnLine 22, 2023 DOI: 10.1186/s12938-023-01113-y
  142. “Llama 2: Open Foundation and Fine-Tuned Chat Models”, 2023 arXiv:2307.09288
  143. “LLaMA: Open and Efficient Foundation Language Models”, 2023 arXiv:2302.13971
  144. “Training Data-Efficient Image Transformers & Distillation through Attention” In The International Conference on Machine Learning 139, 2021, pp. 10347–10357 URL: https://proceedings.mlr.press/v139/touvron21a.html
  145. “Building Flexible, Scalable, and Machine Learning-ready Multimodal Oncology Datasets”, 2023 arXiv:2310.01438
  146. “Detecting Pneumonia using Vision Transformer and Comparing with Other Techniques” In International Conference on Electronics, Communication and Aerospace Technology (ICECA), 2021, pp. 12–16 DOI: 10.1109/ICECA52323.2021.9676146
  147. “Attention Is All You Need” In Advances in Neural Information Processing Systems 30, 2017
  148. Karin Verspoor and Kevin Bretonnel Cohen “Natural Language Processing” In Encyclopedia of Systems Biology Springer New York, 2013, pp. 1495–1498 DOI: 10.1007/978-1-4419-9863-7˙158
  149. Changhan Wang, Kyunghyun Cho and Jiatao Gu “Neural Machine Translation with Byte-Level Subwords” In AAAI Conference on Artificial Intelligence, 2020, pp. 9154–9160
  150. “GIT: A Generative Image-to-text Transformer for Vision and Language”, 2022 arXiv:2205.14100
  151. “A Comprehensive Survey of Continual Learning: Theory, Method and Application”, 2023 arXiv:2302.00487
  152. “Self-Instruct: Aligning Language Models with Self-Generated Instructions”, 2023 arXiv:2212.10560
  153. “MedCLIP: Contrastive Learning from Unpaired Medical Images and Text”, 2022 arXiv:2210.10163
  154. “SimVLM: Simple Visual Language Model Pretraining with Weak Supervision” In International Conference on Learning Representations (ICLR), 2022
  155. “Multimodal Data Integration for Oncology in the Era of Deep Neural Networks: A Review”, 2023 arXiv:2303.06471
  156. “FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation”, 2019 arXiv:1903.11816
  157. “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”, 2016 arXiv:1609.08144
  158. “Aggregated Residual Transformations for Deep Neural Networks” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5987–5995 URL: https://api.semanticscholar.org/CorpusID:8485068
  159. “SimMIM: A Simple Framework for Masked Image Modeling” In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 9643–9653 URL: https://api.semanticscholar.org/CorpusID:244346275
  160. “An Improved Transformer Network for Skin Cancer Classification” In Computers in Biology and Medicine 149, 2022, pp. 105939 DOI: https://doi.org/10.1016/j.compbiomed.2022.105939
  161. “Learning Domain Adaptation with Model Calibration for Surgical Report Generation in Robotic Surgery” In 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 12350–12356
  162. “Convolutional Neural Networks: an Overview and Application in Radiology” In Insights into Imaging 9, 2018 DOI: 10.1007/s13244-018-0639-9
  163. “A Large Language Model for Electronic Health Records” In NPJ Digital Medicine 5, 2022 URL: https://api.semanticscholar.org/CorpusID:255175535
  164. “Hierarchical Attention Networks for Document Classification” In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1480–1489 DOI: 10.18653/v1/N16-1174
  165. “Evaluating Progress in Automatic Chest X-ray Radiology Report Generation” In Patterns 4, 2023, pp. 100802 DOI: 10.1016/j.patter.2023.100802
  166. “RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training”, 2023 arXiv:2303.00534
  167. “From Recognition to Cognition: Visual Commonsense Reasoning” In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6713–6724 DOI: 10.1109/CVPR.2019.00688
  168. “MedDialog: Large-scale Medical Dialogue Datasets” In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9241–9250 DOI: 10.18653/v1/2020.emnlp-main.743
  169. “Investigating the Catastrophic Forgetting in Multimodal Large Language Models”, 2023 arXiv:2309.10313
  170. “Medical Visual Question Answering via Conditional Reasoning” In The 28th ACM International Conference on Multimedia, 2020 URL: https://ieeexplore.ieee.org/document/9999450?denied=
  171. Hanwang Zhang, Yulei Niu and Shih-Fu Chang “Grounding Referring Expressions in Images by Variational Context” In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4158–4166 DOI: 10.1109/CVPR.2018.00437
  172. “Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing”, 2023 arXiv:2303.00915
  173. “BERTScore: Evaluating Text Generation with BERT” In International Conference on Learning Representations, 2020 URL: https://openreview.net/forum?id=SkeHuCVFDr
  174. “Adapter Learning in Pretrained Feature Extractor for Continual Learning of Diseases”, 2023 arXiv:2304.09042
  175. “BioWordVec, Improving Biomedical Word Embeddings with Subword Information and MeSH” In Scientific Data 6, 2019
  176. “Retrieving Multimodal Information for Augmented Generation: A Survey”, 2023 arXiv:2303.10868
  177. “Deep Supervised Cross-Modal Retrieval” In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10386–10395 DOI: 10.1109/CVPR.2019.01064
  178. “A Survey of Large Language Models in Medicine: Progress, Application, and Challenge”, 2023 arXiv:2311.05112
  179. “Learning without Forgetting for Vision-Language Models”, 2023 arXiv:2305.19270
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Iryna Hartsock (8 papers)
  2. Ghulam Rasool (32 papers)
Citations (24)