Papers
Topics
Authors
Recent
Search
2000 character limit reached

SL-HyDE: Self-Learning for Zero-Shot Retrieval

Updated 29 March 2026
  • Self-Learning HyDE (SL-HyDE) is an iterative framework that leverages LLMs to generate pseudo-documents for zero-shot dense medical retrieval.
  • It employs a dual self-learning loop, fine-tuning both a generator and a dense retriever without the need for curated query–document pairs to capture domain semantics.
  • Empirical results on the Chinese Medical Information Retrieval Benchmark show significant nDCG@10 gains, demonstrating its efficacy in specialized medical retrieval.

Self-Learning Hypothetical Document Embeddings (SL-HyDE) is an end-to-end framework for zero-shot dense retrieval in specialized domains, notably Chinese medical information retrieval, which operates without any need for relevance-labeled training data. SL-HyDE iteratively bootstraps between a LLM generator and a dense retriever, leveraging large unlabeled medical corpora. Through a mutual self-learning loop, it injects domain knowledge into both components, significantly advancing the state of zero-shot dense retrieval where curated query–document pairs are limited or absent (Li et al., 2024).

1. Motivation and Conceptual Foundations

Traditional dense retrievers rely on abundant relevance-labeled query–document pairs for optimal performance. In specialized settings such as medical MIR—especially for languages beyond English—such annotated resources are rare or prohibitively expensive to obtain. Hypothetical Document Embeddings (HyDE) proposed prompting an LLM to generate a pseudo-document from a query, then searching for real documents whose embeddings are closest to the pseudo-document. Nonetheless, off-the-shelf LLMs often lack domain specificity, and general-purpose retrievers inadequately encode medical terminologies. SL-HyDE addresses these deficiencies through a dual self-learning approach:

  • The generator is self-supervised to produce hypothetical documents that maximize retrieval of ground-truth medical passages.
  • The retriever is self-supervised on triplets of (query, hypothetical document, true document), adapting embedding spaces to reflect domain terminology and semantics.
  • The mutual bootstrapping leverages an unlabeled medical corpus (e.g., Huatuo26M_encyclopedia), simulating relevance signals and achieving robust zero-shot adaptation for dense retrieval in the medical domain (Li et al., 2024).

2. Generator and Retriever Architectures

The generator, denoted as GG, is an instruction-tuned LLM (examples: Qwen2.5-32B-Instruct, ChatGLM3-6B, Llama2-7B-Chat) that maps a query qq—such as a medical question or title—into a hypothetical document d^\hat d:

d^=G(q;Prompt).\hat d = G(q; \text{Prompt}).

Prompt templates include Q2P (“Please generate a medical content paragraph to answer this question. Question: [QUESTION] Paragraph:”), T2P (based on title), and P2P (for similar-text tasks). Deterministic generation (τgen=0\tau_{\rm gen} = 0) ensures reproducibility during self-learning.

The retriever, RR, is a dense dual-encoder (e.g., based on BGE-Large-zh), encoding both queries and documents as vectors, and utilizing either inner product or cosine similarity for retrieval.

The generator is optimized by minimizing next-token cross-entropy over pseudo-labeled pairs (q,d)(q, d^*):

LG=(q,d^)DLLMt=1TlogPG(d^(t)q,d^(<t)).\mathcal{L}_G = -\sum_{(q,\hat d)\in D_{LLM}} \sum_{t=1}^{T} \log P_G(\hat d^{(t)} \mid q, \hat d^{(<t)} ).

The retriever is fine-tuned using a contrastive loss over (query, document, negative) triplets, incorporating both in-batch and approximate nearest neighbor (ANN) hard negatives:

LR=(q,d)Demblogexp(s(q,d)/τ)exp(s(q,d)/τ)+dBDexp(s(q,d)/τ).\mathcal{L}_R = - \sum_{(q,d)\in D_{emb}} \log \frac{ \exp( s(q,d)/\tau ) }{ \exp( s(q,d)/\tau ) + \sum_{d^- \in B \cup D^-} \exp ( s(q, d^-)/\tau ) }.

3. Self-Learning Loop: Workflow and Mathematical Formulation

The SL-HyDE self-learning process proceeds in the following stages:

  1. Initialization: The retriever RR is initialized from a general-purpose dense model; the generator GG starts as an off-the-shelf LLM.
  2. Query Construction: Using the unlabeled corpus DD, synthetic queries are generated by prompting the LLM to create plausible medical questions from raw text.
  3. Hypothetical Document Generation: For each synthetic query qq, the generator produces KK candidate pseudo-documents.
  4. Retrieval-Guided Selection: Each candidate did'_i is treated as a query by RR, ranking corpus DD to locate the passage dd which originally seeded the query. The candidate dd^* with best rank (lowest rir_i) is selected.
  5. Generator Fine-Tuning: The (q,d)(q, d^*) pairs are used to update GG via cross-entropy loss.
  6. Retriever Fine-Tuning: The fine-tuned GG generates hypothetical documents for all queries; hard negatives are mined via ANN; the retriever is updated via contrastive loss on triplets (q,d,d)(q, d', d).
  7. Iteration: The process may be repeated, using updated RR to select higher-quality pseudo-labels. Empirically, one iteration yields substantial improvements.

Vector aggregation for inference uses both the original query and generated pseudo-documents:

vˉq=1N+1(Mr(q)+k=1NMr(dk)),\bar v_q = \frac{1}{N+1} \left( \mathcal{M}_r(q) + \sum_{k=1}^N \mathcal{M}_r(d'_k) \right),

and documents dDd \in D are ranked according to vˉq,Mr(d)\langle \bar v_q, \mathcal{M}_r(d) \rangle (Li et al., 2024).

4. Evaluation: Chinese Medical Information Retrieval Benchmark (CMIRB)

To measure SL-HyDE under realistic conditions, the Chinese Medical Information Retrieval Benchmark (CMIRB) comprises five task categories:

Task Category Example Datasets Description
Medical Knowledge Retrieval MedExam, DuBaike, DXYDisease Exam QA, encyclopedia QA, disease definitions
Medical Consultation Retrieval MedicalRetrieval, CmedqaRetrieval, DXYConsult Online and forum-based medical Q&A
Medical News Retrieval CovidRetrieval COVID-19 news articles
Medical Post Retrieval IIYiPost Forum post title-to-content retrieval
Medical Literature Retrieval CSLCite, CSLRel Paper title-to-abstract retrieval, similar paper search

The datasets vary in query/document scale (hundreds to tens of thousands), query lengths (7.6–281.8 words), and cover diverse real-world, high-value retrieval scenarios. Performance is assessed via nDCG@10 and Recall@100 (Li et al., 2024).

5. Experimental Performance and Analysis

SL-HyDE consistently substantially improves retrieval metrics compared to both base retrievers and vanilla HyDE:

  • Overall improvement: In the Qwen2 + BGE-Large-zh setup, SL-HyDE yields an average nDCG@10 gain of +7.2% over BGE and +4.9% over HyDE across all ten CMIRB datasets. Notable task-level results: +11.0% (MedExam), +15.6% (DuBaike), +6.4% (DXYConsult).
  • Generator generalization: Using ChatGLM3-6B as generator provides a +4.65% nDCG@10 gain over HyDE; Llama2-7B yields +8.23%, underscoring adaptability to different LLM backbones even with domain mismatch.
  • Retriever generalization: Fine-tuning the PEG retriever with SL-HyDE raises average nDCG@10 from 57.46% to 60.97% (+5.48%). An mE5 retriever shows +3.90% improvement.
  • Ablation and fusion: Combining query and generated documents via mean pooling is superior to using only the pseudo-document or concatenation. Ablating retriever fine-tuning drops nDCG@10 by 2.27 points; ablating generator fine-tuning drops performance by 0.61 points, indicating both are essential. Generating multiple hypothetical documents (K=5K=5) provides modest further gains at increased inference cost (Li et al., 2024).

6. Practical Considerations, Limitations, and Implications

Key practical aspects include:

  • Computational cost: Fine-tuning LLM and retriever requires multiple A100 GPUs but is computationally manageable relative to full-scale pretraining for domain adaptation. Inference cost scales with the number of generated pseudo-documents per query.
  • LLM scale and prior knowledge: Large, instruction-tuned LLMs yield higher-quality pseudo-documents. Smaller or out-of-domain LLMs (English-centric models) benefit more from self-learning but may need further adaptation.
  • Dependence on corpus coverage: A large, diverse unlabeled corpus (10,000–100,000 passages) is necessary. Limited representation of rare or highly specialized entities restricts effectiveness.
  • Hallucination and filtering: LLM-generated content is susceptible to hallucinations. The retrieval-guided pseudo-label selection step mitigates—though does not fully eliminate—spurious content.
  • Loop iteration: Substantial improvements can be achieved from one self-learning iteration, with diminishing returns from further cycles.

A plausible implication is that the SL-HyDE paradigm could generalize to other specialized domains and languages given sufficient unlabeled data and domain-adapted LLM initialization (Li et al., 2024).

7. Summary and Significance

SL-HyDE achieves practical zero-shot domain adaptation for dense retrieval using self-learning between an LLM-based pseudo-document generator and a dense retriever, eliminating costly manual annotation. By iterative refinement and self-supervised fine-tuning on synthetic triplets, it enables high-precision MIR in settings such as Chinese medical literature. Rigorous evaluation on CMIRB demonstrates decisive gains over existing methods and robust extensibility across both LLM and retriever architectures. SL-HyDE thus establishes a new state-of-the-art approach for domain-adaptive retrieval in low-resource, high-specialization knowledge contexts (Li et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Learning HyDE (SL-HyDE).