- The paper introduces MVSL, a framework that decouples visual and textual adaptation through cross-paradigm fine-tuning to improve lesion-level discrimination.
- It applies multi-granularity contrastive learning by aligning global image-text features with patch-level semantics to boost few-shot and zero-shot performance.
- The framework integrates a disease semantic graph to regularize text embeddings, ensuring intra-class consistency and inter-class separation for reliable classification.
Multi-View Synergistic Learning with Vision-Language Adaption for Low-Resource Biomedical Image Classification
Motivation and Context
Low-resource biomedical image classification faces persistent challenges: scarcity of annotated data, subtle inter-class visual distinctions, and complex semantic relationships among diseases. Vision-LLMs (VLMs) such as CLIP, ALIGN, and their biomedical variants (BiomedCLIP, PubMedCLIP, PMC-CLIP) provide robust foundations by leveraging large-scale cross-modal pretraining but require parameter-efficient adaptation to downstream medical tasks. Traditional fine-tuning paradigms often employ unified strategies for both visual and textual components, ignoring their distinct representational properties and leading to unstable alignment, particularly under limited supervision.
Framework and Methodological Innovations
The MVSL framework addresses these limitations through three principal innovations:
1. Cross-Paradigm Fine-Tuning (CPFT):
MVSL decouples visual and textual adaptation. The visual branch utilizes structured residual adapters within selected transformer blocks (primarily higher layers) for fine-grained modulation of spatial features, enhancing lesion-level discrimination while maintaining backbone stability. The textual branch leverages learnable prompts, enriched via LLM-generated descriptions, to inject task-specific semantics, avoiding disruptions to pretrained text encoder structure. This asymmetric approach respects modality differences and enables parameter-efficient optimization.
2. Multi-Granularity Contrastive Learning (MGCL):
Beyond global image-text alignment, MVSL introduces patch-level alignment, enforcing correspondence between local visual regions (e.g., lesion sites) and disease-specific text semantics. Both global and local contrastive losses are integrated, and predictions across granularities are dynamically fused through learnable coefficients, enhancing robustness and generalizability in few-shot and zero-shot settings.
3. Disease Semantic Graph (DSG):
MVSL constructs a class-level semantic topology by deriving text embeddings for each disease class from LLM-generated prompts, encoding inter-class semantic proximity via a soft adjacency matrix. DSG regularizes the textual feature space through Laplacian-based distillation, preserving intra-class consistency and inter-class separation. This structural guidance indirectly regularizes visual embeddings via cross-modal alignment.
Strong Empirical Results
MVSL was validated across 11 public biomedical datasets, spanning 9 modalities and 10 anatomical regions. Key findings include:
- Few-shot classification: MVSL consistently outperformed state-of-the-art baselines (BiomedCoOp, CoOp, ProGrad) in accuracy across all low-shot regimes. Notable improvements over BiomedCoOp, with gains ranging from +1.38% (K=1) to +4.71% (K=16) in average accuracy, demonstrating superior scalability as labeled data increases.
- Base-to-novel generalization: MVSL achieved the highest harmonic mean (HM) accuracy (77.22%), outperforming all compared approaches, reflecting balanced recognition of both seen and unseen classes.
- Ablations: The cross-paradigm tuning (Ptext​+Aimg​) yielded best performance; inserting adapters solely in upper layers maximized gains, while excessive adapter stacking degraded generalization. Integration of MGCL and DSG was mutually essential, together improving both discriminability and structural robustness.
- Qualitative interpretability: Saliency maps indicated MVSL's attention was highly concentrated in clinically relevant regions. t-SNE visualizations showed compact and well-separated clusters, substantiating feature discrimination improvements.
Theoretical Implications and Practical Applications
MVSL demonstrates that meaningful parameter-efficient adaptation requires respecting modality-specific encoder characteristics rather than applying adaptation schemes blindly across vision and language branches. Modeling both fine-grained (lesion-level) and coarse-grained (global image-level) correspondences, complemented by explicit semantic topology regularization, is crucial for robust performance in low-resource biomedical settings.
Practically, MVSL provides a scalable, interpretable, and reliable framework deployable in real-world medical analysis, where annotation scarcity is routine and clinical decision support requires trustworthy spatial interpretability. The modular design permits integration with evolving biomedical VLMs and LLMs, supporting rapid domain adaptation and flexibility across heterogenous medical imaging tasks.
Future Directions
MVSL's multi-view synergistic strategy opens avenues for:
Conclusion
MVSL achieves consistent improvements in low-resource biomedical image classification by combining modality-aware fine-tuning, multi-granularity contrastive alignment, and semantic-structural regularization. These synergistic components underpin MVSL's balanced adaptability, discriminability, and interpretability, marking its value as a robust foundation for clinical-grade multimodal AI systems in demanding data-scarce environments (2604.23977).