- The paper introduces BioHiCL, a framework aligning embedding similarity with MeSH-based label hierarchies to enhance retrieval precision.
- It employs regression alignment and hierarchy-aware contrastive loss alongside LoRA-based tuning for robust, efficient biomedical text representation.
- Empirical results demonstrate improved IR, sentence similarity, and QA performance with low latency and memory requirements.
Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH
Motivation and Background
Biomedical information retrieval (IR) presents unique challenges due to its reliance on highly specialized terminology and the prevalence of complex, hierarchically structured semantic relationships between concepts. Most dense retrievers in the biomedical domain have focused on modeling semantic similarity through coarse binary relevance signals—either a document is deemed “relevant” or not—ignoring graded or partially overlapping meanings that are prevalent in clinical and scientific texts.
Existing biomedical IR models frequently rely on domain-pretrained LLMs and contrastive learning, yet their supervision signals are too coarse to capture nuanced semantic overlap. MeSH (Medical Subject Headings) provides a curated, hierarchical ontology that encodes latent semantic relationships far beyond binary categorization. Leveraging the depth and structure of MeSH allows for more granular modeling of semantic similarity.
Figure 1: Example of sentence pairs labeled as neutral in MedNLI, but with MeSH annotations sharing a common parent in the disease hierarchy, exposing relatedness not captured by binary labels.
The BioHiCL Framework
Hierarchical Multi-Label Supervision
BioHiCL (Biomedical Retrieval with Hierarchical Multi-Label Contrastive Learning) proposes aligning embedding similarity directly with MeSH-based label similarity. For any pair of biomedical abstracts, the degree of similarity is computed by weighting MeSH label matches according to their depth in the ontology hierarchy, favoring more specific, deeper nodes to emphasize semantically precise overlap. This multi-label, hierarchy-aware structure provides a fine-grained supervisory signal for embedding learning, in contrast to traditional contrastive approaches using binary relevance.
Modeling and Objectives
Each biomedical sentence or abstract is mapped through a dense encoder (based on BGE) to an embedding space. MeSH annotations, supplemented by their ancestors in the hierarchy, are encoded as multi-hot vectors with depth-based weighting, resulting in a specificity-sensitive, high-dimensional label representation.
Two main objectives are employed:
Efficient Adaptation with LoRA
BioHiCL adapts a general-domain retriever to the biomedical domain via LoRA-based parameter-efficient fine-tuning, injecting low-rank adapters into the backbone weights while freezing the vast majority of parameters. This yields rapid adaptation without the memory footprint or compute requirements of full fine-tuning.
Empirical Results
BioHiCL was evaluated on multiple biomedical benchmarks, including information retrieval (NFCorpus, TREC-COVID, SciFact, SCIDOCS), sentence similarity (BIOSSES, SciFact sentences), and question answering (PubMedQA). Both the BioHiCL-Base (0.1B) and BioHiCL-Large (0.3B) variants were extensively benchmarked against leading general and biomedical-domain dense retrievers, covering a range of parameter sizes and training paradigms.
Key findings:
- Retrieval Effectiveness: BioHiCL-Base achieves the highest IR average (0.543) despite its compact size, outperforming larger models such as BMRetriever-1B (1B parameters). BioHiCL-Large further improves or matches performance on select benchmarks.
- Sentence Similarity and QA: BioHiCL-Base achieves the highest BIOSSES Spearman correlation (0.896), while BioHiCL-Large reaches the best Recall@1 on PubMedQA (0.898). Both models are robust across tasks without a reliance on task-specific prompts.
- Computational Efficiency: Both variants exhibit low latency and modest memory consumption (e.g., 3.5 ms/doc corpus encoding for BioHiCL-Base, 0.63 ms/query query encoding), making them suitable for real-time, large-scale deployments on standard hardware.
Ablation studies confirm that all four architectural and training design choices (inclusion of ancestor labels, depth-based weighting, regression alignment, and contrastive loss) are essential for peak performance. In particular, omitting the contrastive objective or hierarchical label expansion consistently degrades retrieval quality.
Implications and Future Directions
The use of hierarchical, multi-label supervision based on MeSH provides a substantially richer training signal for biomedical dense retrieval. This moves beyond binary or instance-level contrastive objectives toward models with graded, specificity-aware semantic matching. As a result, BioHiCL’s learned embeddings capture nuanced partial overlaps and hierarchical semantic ties critical for biomedical knowledge work.
Practically, the demonstrated efficiency and effectiveness of BioHiCL suggest the viability of wide deployment in clinical and scientific information systems, with resource demands compatible with existing GPU infrastructure. The superior performance of BioHiCL at small and medium model scales also underscores the utility of parameter-efficient fine-tuning for domain adaptation.
Theoretically, the framework generalizes to other domains where expert-curated, hierarchical multi-label ontologies exist, such as e-commerce taxonomy or Wikipedia categories. Extending this approach could drive advances in dense retrieval across diverse semi-structured, hierarchical-labeled corpora.
Limitations
BioHiCL is dependent on high-quality, domain-wide hierarchical annotation resources. Domains lacking such curated ontologies, or those in which the hierarchy does not reflect semantic specificity appropriately, may not benefit from this method. Additionally, the fixed weighting approach may not always correspond to task- or context-specific relevance.
Conclusion
BioHiCL demonstrates that incorporating expert-curated hierarchical multi-label structures into contrastive learning yields dense biomedical retrievers that are both efficient and highly effective. By explicitly aligning embedding geometry with the graded, hierarchy-aware semantics of MeSH, BioHiCL substantially improves representational fidelity for biomedical text, supporting more accurate and nuanced information access. Extensions to other hierarchically-annotated domains may facilitate similar improvements in retrieval and representation learning.