Multi-View Synergistic Learning with Vision-Language Adaption for Low-Resource Biomedical Image Classification

Published 27 Apr 2026 in cs.CV | (2604.23977v1)

Abstract: Accurate biomedical image classification under low-resource conditions remains challenging due to limited annotations, subtle inter-class visual differences, and complex disease semantics. While vision--LLMs offer a promising foundation for mitigating data scarcity, their effective adaptation in biomedical settings is constrained by the need for parameter-efficient tuning alongside fine-grained and semantically consistent representation learning. In this work, we propose Multi-View Synergistic Learning (MVSL), a unified framework that addresses these challenges by jointly considering adaptation paradigms, representation granularity, and disease semantic relationships. MVSL decouples the adaptation of visual and textual encoders to respect their distinct representational characteristics, enabling more stable and effective parameter-efficient fine-tuning. It further introduces multi-granularity contrastive learning to explicitly model both global image semantics and localized lesion-level evidence, improving fine-grained discrimination for visually similar disease categories. In addition, MVSL preserves disease-level semantic structure by incorporating structured supervision derived from LLMs, which constrains textual representations at the class level and indirectly regularizes visual embeddings through cross-modal alignment. Together, these components enable more stable cross-modal alignment and improved discrimination under limited supervision. Extensive experiments on $11$ public biomedical datasets spanning $9$ imaging modalities and $10$ anatomical regions demonstrate that MVSL consistently outperforms state-of-the-art methods in few-shot and zero-shot classification settings.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces MVSL, a framework that decouples visual and textual adaptation through cross-paradigm fine-tuning to improve lesion-level discrimination.
It applies multi-granularity contrastive learning by aligning global image-text features with patch-level semantics to boost few-shot and zero-shot performance.
The framework integrates a disease semantic graph to regularize text embeddings, ensuring intra-class consistency and inter-class separation for reliable classification.

Multi-View Synergistic Learning with Vision-Language Adaption for Low-Resource Biomedical Image Classification

Motivation and Context

Low-resource biomedical image classification faces persistent challenges: scarcity of annotated data, subtle inter-class visual distinctions, and complex semantic relationships among diseases. Vision-LLMs (VLMs) such as CLIP, ALIGN, and their biomedical variants (BiomedCLIP, PubMedCLIP, PMC-CLIP) provide robust foundations by leveraging large-scale cross-modal pretraining but require parameter-efficient adaptation to downstream medical tasks. Traditional fine-tuning paradigms often employ unified strategies for both visual and textual components, ignoring their distinct representational properties and leading to unstable alignment, particularly under limited supervision.

Framework and Methodological Innovations

The MVSL framework addresses these limitations through three principal innovations:

1. Cross-Paradigm Fine-Tuning (CPFT):

MVSL decouples visual and textual adaptation. The visual branch utilizes structured residual adapters within selected transformer blocks (primarily higher layers) for fine-grained modulation of spatial features, enhancing lesion-level discrimination while maintaining backbone stability. The textual branch leverages learnable prompts, enriched via LLM-generated descriptions, to inject task-specific semantics, avoiding disruptions to pretrained text encoder structure. This asymmetric approach respects modality differences and enables parameter-efficient optimization.

2. Multi-Granularity Contrastive Learning (MGCL):

Beyond global image-text alignment, MVSL introduces patch-level alignment, enforcing correspondence between local visual regions (e.g., lesion sites) and disease-specific text semantics. Both global and local contrastive losses are integrated, and predictions across granularities are dynamically fused through learnable coefficients, enhancing robustness and generalizability in few-shot and zero-shot settings.

3. Disease Semantic Graph (DSG):

MVSL constructs a class-level semantic topology by deriving text embeddings for each disease class from LLM-generated prompts, encoding inter-class semantic proximity via a soft adjacency matrix. DSG regularizes the textual feature space through Laplacian-based distillation, preserving intra-class consistency and inter-class separation. This structural guidance indirectly regularizes visual embeddings via cross-modal alignment.

Strong Empirical Results

MVSL was validated across 11 public biomedical datasets, spanning 9 modalities and 10 anatomical regions. Key findings include:

Few-shot classification: MVSL consistently outperformed state-of-the-art baselines (BiomedCoOp, CoOp, ProGrad) in accuracy across all low-shot regimes. Notable improvements over BiomedCoOp, with gains ranging from +1.38% ( $K$ =1) to +4.71% ( $K$ =16) in average accuracy, demonstrating superior scalability as labeled data increases.
Base-to-novel generalization: MVSL achieved the highest harmonic mean (HM) accuracy (77.22%), outperforming all compared approaches, reflecting balanced recognition of both seen and unseen classes.
Ablations: The cross-paradigm tuning ( $P_\text{text}+A_\text{img}$ ) yielded best performance; inserting adapters solely in upper layers maximized gains, while excessive adapter stacking degraded generalization. Integration of MGCL and DSG was mutually essential, together improving both discriminability and structural robustness.
Qualitative interpretability: Saliency maps indicated MVSL's attention was highly concentrated in clinically relevant regions. t-SNE visualizations showed compact and well-separated clusters, substantiating feature discrimination improvements.

Theoretical Implications and Practical Applications

MVSL demonstrates that meaningful parameter-efficient adaptation requires respecting modality-specific encoder characteristics rather than applying adaptation schemes blindly across vision and language branches. Modeling both fine-grained (lesion-level) and coarse-grained (global image-level) correspondences, complemented by explicit semantic topology regularization, is crucial for robust performance in low-resource biomedical settings.

Practically, MVSL provides a scalable, interpretable, and reliable framework deployable in real-world medical analysis, where annotation scarcity is routine and clinical decision support requires trustworthy spatial interpretability. The modular design permits integration with evolving biomedical VLMs and LLMs, supporting rapid domain adaptation and flexibility across heterogenous medical imaging tasks.

Future Directions

MVSL's multi-view synergistic strategy opens avenues for:

Extending graph-based semantic regularization to hierarchical disease taxonomies or multi-label classification.
Integration with continual learning and domain generalization frameworks for robust adaptation to evolving medical data distributions.
Exploration of further interaction between visual and language prompts within the fusion mechanism, potentially leveraging joint prompt optimization.
Deploying MVSL in clinical workflows to facilitate interpretability-driven human-AI collaboration.

Conclusion

MVSL achieves consistent improvements in low-resource biomedical image classification by combining modality-aware fine-tuning, multi-granularity contrastive alignment, and semantic-structural regularization. These synergistic components underpin MVSL's balanced adaptability, discriminability, and interpretability, marking its value as a robust foundation for clinical-grade multimodal AI systems in demanding data-scarce environments (2604.23977).

Markdown Report Issue