Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training (2305.07920v3)

Published 13 May 2023 in cs.CV and cs.MM

Abstract: In recent years, the growing demand for medical imaging diagnosis has placed a significant burden on radiologists. As a solution, Medical Vision-Language Pre-training (Med-VLP) methods have been proposed to learn universal representations from medical images and reports, benefiting downstream tasks without requiring fine-grained annotations. However, existing methods have overlooked the importance of cross-modal alignment in joint image-text reconstruction, resulting in insufficient cross-modal interaction. To address this limitation, we propose a unified Med-VLP framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework to achieve more comprehensive cross-modal interaction, while a Global and Local Alignment (GLA) module is designed to assist self-supervised paradigm in obtaining semantic representations with rich domain knowledge. Furthermore, we introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction and fuse the multi-modal representations adequately. Experimental results demonstrate that the proposed unified approach outperforms previous methods in all downstream tasks, including uni-modal, cross-modal, and multi-modal tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ke Zhang (264 papers)
  2. Yan Yang (119 papers)
  3. Jun Yu (232 papers)
  4. Hanliang Jiang (2 papers)
  5. Jianping Fan (51 papers)
  6. Qingming Huang (168 papers)
  7. Weidong Han (8 papers)
Citations (16)