W-RAG Framework: Weakly Supervised Dense Retrieval
- W-RAG is a weakly supervised dense retrieval framework that leverages LLM-generated log-likelihood signals to identify passages that directly boost answer accuracy in RAG systems.
- The framework employs a three-stage pipeline: initial BM25 retrieval, LLM-based reranking for weak label generation, and fine-tuning of neural retrievers like DPR and ColBERT.
- Empirical results on datasets such as MSMARCO QnA demonstrate that W-RAG significantly improves recall and answer generation metrics, narrowing the gap with fully supervised methods.
The W-RAG (Weakly Supervised Dense Retrieval in RAG) framework introduces an LLM-leveraged weak supervision methodology for training dense retrievers in Retrieval-Augmented Generation (RAG) systems, specifically targeting open-domain question answering (OpenQA) without reliance on human-annotated supporting passages. W-RAG draws weak training signals by quantifying the ability of LLMs to generate correct answers using retrieved passages, and fine-tunes neural retrievers to prioritize passages that directly enhance downstream answer accuracy.
1. Motivation and Problem Setting
Open-domain question answering tasks demand high factual accuracy and evidence-based responses. While LLMs demonstrate strong language understanding, their reliance solely on parametric memory results in pronounced hallucinations. RAG systems mitigate this by incorporating a retriever that supplies relevant external passages; however, conventional dense retriever training typically necessitates large-scale, human-annotated datasets identifying which passages support an answer, imposing significant annotation overhead. Weakly supervised and unsupervised dense retriever approaches exist, yet these often optimize for superficial semantic similarity rather than directly for answerability—the capacity of a passage to enable the LLM to generate the reference answer.
W-RAG addresses this by leveraging the log-likelihood of the LLM generating the ground-truth answer given a passage, operationalizing the notion of task-oriented relevance in weak labeling for retriever training.
2. Methodology and W-RAG Pipeline
The W-RAG pipeline comprises three main stages, orchestrated to systematically convert downstream LLM answerability signals into dense retriever training objectives:
- Stage 1: Initial Retrieval
- For each QA pair , retrieve a candidate pool of passages from the corpus using a high-recall sparse retriever (e.g., BM25).
- is chosen to ensure broad coverage of potentially answer-bearing passages.
- Stage 2: LLM-based Reranking and Weak Label Generation
- For each candidate passage , prompt the LLM with the tuple , where is an instructional prompt describing the QA context.
- Calculate the log-likelihood score representing the LLM’s probability of generating conditioned on :
where denotes the -th answer token. - Rank passages by log-likelihood. The top-ranked passage serves as a "positive" for dense retriever fine-tuning; lower-ranked passages or in-batch negatives serve as "negatives."
- Stage 3: Weakly-Supervised Dense Retriever Fine-Tuning
- Utilize standard dense retriever architectures (DPR or ColBERT).
- For DPR:
Employ Multiple Negatives Ranking (MNR) loss:
(typically 20) sharpens the softmax. - For ColBERT:
The loss is triplet-based:
Positives are top log-likelihood passages; negatives are lower-ranked or hard negatives.
This approach requires only QA pairs and a document corpus, producing weak but actionable supervision for retriever updating.
3. Empirical Evaluation and Results
W-RAG’s empirical paper spans four standard OpenQA datasets: MSMARCO QnA, NQ, SQuAD, and WebQ, each using approximately 2000 training QA pairs and a 500K-passage retrieval corpus.
| Retriever | MSMARCO QnA R@1 (Top 1) | MSMARCO QnA F1 Score |
|---|---|---|
| ColBERT (init) | 0.0120 | - |
| BM25 | 0.1647 | 0.3060 |
| Contriever | 0.1585 | 0.3125 |
| ColBERT (W-RAG) | 0.1973 | 0.3150 |
| DPR (W-RAG) | 0.2023 | 0.3397 |
| ColBERT (supervised) | 0.2097 | 0.3227 |
Key findings:
- Weak labeling via LLM log-likelihood ranking substantially increases answer-evident recall: on MSMARCO QnA, Recall@1 rises from 0.1694 (BM25) to 0.5239 (Llama3-8B reranking).
- W-RAG-trained retrievers outperform unsupervised baselines (BM25, Contriever, untrained ColBERT), narrowing the gap with fully supervised retrievers (e.g., supervised ColBERT's R@1 0.2097 vs. W-RAG ColBERT's 0.1973).
- Answer generation metrics (F1, Rouge-L, BLEU-1) show similar trends; for instance, ColBERT (W-RAG) achieves F1 0.3150 vs. ColBERT (supervised) 0.3227 and BM25 0.3060.
- Ablations confirm that prompt format and LLM choice moderately influence weak label quality, but all configurations yield improvements over purely unsupervised retrieval.
4. Technical Details and Implementation Considerations
- LLM Prompting: Each candidate passage is combined with question, gold answer, and explicit instructions for the LLM, ensuring likelihoods reflect true answer-generating potential.
- Log-Likelihood Averaging: To address vanishing probabilities for long answers, training uses mean log-probability per answer token.
- Retriever Training: Compatible with major dense retrieval frameworks. DPR leverages MNR loss with in-batch negatives; ColBERT adopts a triplet loss with hard negative mining from LLM-ranked candidates.
- Resource Requirements: LLM-based reranking is inference-intensive; increasing candidate pool or incorporating more passages can further enhance label quality at the cost of latency.
- Scalability: Since only QA pairs and a corpus are needed, W-RAG sidesteps the bottleneck of annotated positive passage collection, enabling scalable retriever deployment in resource-constrained settings.
5. Significance and Comparative Impact
W-RAG constitutes a practical advance in RAG system construction for OpenQA, bridging the effectiveness gap between unsupervised and fully supervised retriever paradigms without dependency on costly annotated supporting evidence. By directly tying retriever rewards to answerability (rather than mere semantic similarity), W-RAG aligns the retrieval objective with downstream QA performance, yielding consistent improvements in both retrieval and answer generation metrics across diverse QA benchmarks.
The magnitude of improvement—recall@1 and F1 boosts on MSMARCO QnA, marked narrowing of the performance gap with supervised systems—demonstrates that LLM-based weak supervision is an effective proxy for manual passage annotation. The framework's generic applicability and implementation minimalism facilitate its use as a foundational method for scalable, high-accuracy RAG deployments in open-domain settings.
6. Extensions, Limitations, and Future Directions
- While LLM log-likelihood scores are impactful as weak relevance labels, their effectiveness varies with prompt design and LLM capacity.
- Including additional retrieved passages in the LLM context may further increase label and answer quality, subject to operational latency constraints.
- The framework is extensible to alternative dense retriever architectures, provided they can be trained on positive/negative passage pairs or triplets.
- A plausible implication is that, as LLMs improve in modeling factual dependencies, weak labeling approaches akin to W-RAG will likely become increasingly competitive relative to fully supervised paradigms.
- The method currently presumes high-recall in initial candidate retrieval (BM25 or similar); failures at this stage may limit weak supervision efficacy.
- Comprehensive code and additional detail are provided at the referenced repository: https://github.com/jmnian/weak_label_for_rag
W-RAG establishes a new standard for weakly supervised dense retriever training in RAG pipelines, balancing practicality and performance in real-world, knowledge-intensive language applications.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free