Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models (2506.08990v1)

Published 10 Jun 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-LLMs during efficient alignment also promotes better vision and language understanding. Code is publicly available at https://github.com/DopamineLcy/ALTA.

Summary

The paper introduces ALTA, a method that adapts masked vision models to improve medical vision-language alignment.
It achieves parameter efficiency by using only 8% of trainable parameters and reducing computation to less than 20% of traditional methods.
Experimental results show improvements exceeding 4% in text-to-image accuracy and about 6% in image-to-text retrieval accuracy.

Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models

Recent advancements in integrating vision and language modeling in the field of medical imaging offer enhanced capabilities in tasks such as retrieval and zero-shot classification by leveraging cross-modal contrastive learning. However, conventional methods, like CLIP-based approaches, often encounter performance constraints due to their suboptimal visual representation capabilities. This paper presents ALTA (ALign Through Adapting), an innovative and efficient method for medical vision-language alignment that addresses these limitations by adapting pretrained models focused on multimodal masked vision.

ALTA distinguishes itself by utilizing only about 8% of trainable parameters and substantially reducing computational consumption to less than 1/5 of what is required for masked record modeling tasks. The methodology capitalizes on the strengths of vision models pretrained through masked modeling, which excel in visual representation, even though they traditionally struggle with direct cross-modal matching. By effectively adapting these vision models, ALTA achieves superior performance in vision-language matching tasks.

One notable innovation of ALTA is the integration of temporal-multiview radiograph inputs, which enhances information consistency between radiographs and their corresponding descriptions in medical reports. This integration leads to considerable improvements in vision-language alignment. The experimental evaluations demonstrate that ALTA outperforms its counterparts, with significant improvements exceeding 4% in text-to-image accuracy and approximately 6% in image-to-text retrieval accuracy.

The implications of such advancements in medical vision-language alignment are substantial. Practically, this approach promises more efficient and accurate retrieval systems in medical databases, aiding clinicians in diagnostic processes. Theoretically, the findings suggest that focusing on adapting existing robust vision models rather than solely pursuing new architectures or methods could yield more efficient and scalable solutions for multimodal alignment tasks in medical imaging.

Moreover, the parameter-efficient training incorporated in ALTA can significantly lower the computational barrier for adopting advanced AI models in clinical settings, providing a scalable pathway for enhanced clinical decision support systems. The paper's contribution also lies in its methodological proposition, which separates vision-language alignment as an independent and efficient stage, potentially reducing the burden of pretraining and fostering further research towards optimal alignment strategies based on sophisticated and advanced representations.

In summary, ALTA not only advances the state-of-the-art in medical vision-language alignment but also sets the stage for future developments in AI by demonstrating how parameter efficiency and adaptation of pretrained models can be leveraged to achieve impressive performance enhancements across various medical imaging tasks. The promising results and the innovative approach laid out in this paper will likely inspire ongoing research to further explore and refine efficient methods for integrating vision and LLMs in medical applications.