Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports (2009.01523v1)

Published 3 Sep 2020 in cs.CV

Abstract: Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval, clinical report auto-generation. In this study, we adopt four pre-trained V+L models: LXMERT, VisualBERT, UNIER and PixelBERT to learn multimodal representation from MIMIC-CXR radiographs and associated reports. The extrinsic evaluation on OpenI dataset shows that in comparison to the pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models demonstrate performance improvement in the thoracic findings classification task. We conduct an ablation study to analyze the contribution of certain model components and validate the advantage of joint embedding over text-only embedding. We also visualize attention maps to illustrate the attention mechanism of V+L models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yikuan Li (23 papers)
  2. Hanyin Wang (13 papers)
  3. Yuan Luo (127 papers)
Citations (60)

Summary

We haven't generated a summary for this paper yet.