HF-RAG: Hierarchical Fusion for RAG
- The paper demonstrates that hierarchical fusion of labeled and unlabeled retrieval outputs significantly improves fact verification performance compared to single-source approaches.
- It employs intra-source reciprocal rank fusion and inter-source z-score standardization to effectively resolve score heterogeneity across different IR models.
- Experimental results exhibit robust macro-F1 and nDCG gains, underlining HF-RAG’s adaptability and superior generalization on both in-domain and out-of-domain datasets.
Hierarchical Fusion-based RAG (HF-RAG) is an inference-only augmentation of Retrieval-Augmented Generation systems that integrates evidence from both labeled and unlabeled sources using a hierarchical score fusion pipeline. By systematically combining the outputs of multiple retrieval models and resolving the inherent score heterogeneity and complementary distributional coverage of labeled and unlabeled corpora, HF-RAG achieves robust improvements in fact verification tasks, especially for out-of-domain generalization (Santra et al., 2 Sep 2025).
1. Motivation and Problem Formulation
Standard RAG systems typically perform retrieval from a single source, such as Wikipedia, using a single retrieval model. This design limits adaptability: labeled sources encode explicit input-label associations (task-specific grounding), while unlabeled sources capture broader exogenous knowledge (contextual coverage). When multiple retrievers and sources are available, three central challenges arise:
- Score Heterogeneity: Retrieval scores from distinct IR models or corpora are not directly comparable, impeding effective aggregation.
- Ranker Complementarity: No single IR model (BM25, Contriever, ColBERT, MonoT5) is uniformly optimal across queries or domains.
- Source Complementarity: Over-reliance on labeled or unlabeled sources leads to sub-optimal in-domain or out-of-domain performance.
Combining labeled (e.g., FEVER claim-label pairs) and unlabeled (e.g., Wikipedia) data can leverage the strengths of each: Labeled RAG (L-RAG) provides task-specific anchoring, while Unlabeled RAG (U-RAG) mitigates overfitting and enhances generalization. However, naïve merging is undermined by mismatched score distributions and disparate ranking criteria (Santra et al., 2 Sep 2025).
2. HF-RAG Fusion Architecture
The hierarchical fusion pipeline operates in three stages:
- Intra-Source Ranker Fusion: For each source (labeled or unlabeled), retrieve top- document lists from each IR model . Fuse these lists using Reciprocal Rank Fusion (RRF), computing scores:
The top- documents by fused score form .
- Inter-Source Score Standardization and Merge: For each source, compute the mean and standard deviation of RRF scores in and standardize document scores:
Merge and by sorting all documents by the standardized scores to form the final retrieval set .
- Context Construction for Generation: Concatenate the top- ranked passages in to the input prompt for the LLM.
This method requires no additional training, learned fusion weights, or supervised fine-tuning; it operates solely at inference time.
3. Experimental Setup and Evaluation
HF-RAG was evaluated primarily on the fact verification task, using both in-domain (FEVER) and out-of-domain (Climate-FEVER, SciFact) datasets. The retriever ensemble comprised BM25, Contriever, ColBERT, and MonoT5:
| Retriever | Retrieval Paradigm | Target Corpus |
|---|---|---|
| BM25 | Sparse, lexical | Labeled + Unlabeled |
| Contriever | Dense, bi-encoder | Labeled + Unlabeled |
| ColBERT | Dense, late interaction | Labeled + Unlabeled |
| MonoT5 | Cross-encoder re-ranking | Labeled + Unlabeled |
- Sources: FEVER train claim-label pairs; 2018 Wikipedia dump.
- Generators: LLaMA-2-70B Chat; Mistral-7B Instruct.
- Baselines: Parametric SFT, 0-shot, L-RAG, U-RAG, L-RAG-RRF, U-RAG-RRF, LU-RAG- (linear mixing), RAG-OptSel (oracle).
Performance was measured using macro-F1 (3-way classification) and nDCG@10 for retrieval.
HF-RAG surpassed all baselines in macro-F1:
- FEVER: 0.5744 (LLaMA), 0.5628 (Mistral)
- Climate-FEVER: 0.4838/0.5019 (vs. best single ≈0.5083)
- SciFact: 0.4320/0.4341 (vs. best single ≈0.4246) These results demonstrate that hierarchical fusion not only surpasses the best single ranker/source RAG but even the oracle single selection bound (Santra et al., 2 Sep 2025).
4. Detailed Fusion Algorithms
Intra-Source Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion combines outputs from multiple rankers for a given source by aggregating the reciprocal ranks, yielding robust list-level fusion:
Documents not present in a given list are assigned a rank . The fused list consists of the top- documents by RRF score.
Inter-Source z-Score Standardization
To address inter-source score scale discrepancies, HF-RAG computes:
Standardized scores are
This adjustment permits fair comparison and merging of top- documents from labeled and unlabeled sources.
Pipeline Summary (Pseudocode)
The following summarises the retrieval and fusion process:
1 2 3 4 5 6 7 8 9 10 11 12 |
for C in {labeled, unlabeled}: for theta in retrievers: L_k[C, theta] = retrieve_top_k_docs(x, theta, C) s_C = compute_RRF_scores(L_k[C, :]) D_C = select_top_k_docs(s_C) mu_C, sigma_C = mean_std(s_C[D_C]) phi_C = (s_C[D_C] - mu_C) / sigma_C D_all = D_labeled ∪ D_unlabeled L_k = select_top_k_docs_by_phi(D_all) context = concatenate_text(L_k, x) ŷ = LLM.generate(context) |
5. Comparative Analysis and Insights
Ablation studies reveal that both intra-source RRF and inter-source z-score standardization are critical to performance:
- Within-Source Fusion: L-RAG-RRF and U-RAG-RRF consistently outperform single ranker L-RAG and U-RAG, validating the complementary nature of distinct IR models.
- Between-Source Fusion: LU-RAG- (linear mixing) is less effective than z-score standardization, underscoring the necessity of adjusting for collection bias ().
- Generalization: On SciFact, HF-RAG provides relative improvements of 3–5 F1 points over baselines, indicating resilience to domain shift. OOD sensitivity analyses show adaptive source weighting: greater dependence on unlabeled evidence under high distributional shift and increased use of labeled context when domains are more aligned.
- Robustness to : HF-RAG maintains high performance across a range of context sizes , with performance plateauing beyond .
6. Integration with Generative Models
HF-RAG is architected as an inference-only extension of standard RAG. There are no additional trainable fusion layers; the top- fused documents are concatenated (or encoded as in Fusion-in-Decoder paradigms) and fed to the generative LLM, which synthesizes the final response. Thus, the pipeline is modular and requires no retraining or parameter adaptation for downstream tasks. The only tested generation tasks in the original work are fact verification using LLaMA-2-70B Chat and Mistral-7B Instruct (Santra et al., 2 Sep 2025).
7. Limitations and Future Directions
HF-RAG’s hierarchical fusion remains fully unsupervised, using no learned fusion weights or context-dependent weighting; learned or adaptive methods could further improve retrieval and generation. The evaluation is restricted to fact verification; potential extensions include open-domain QA or multi-agent reasoning tasks. Current results suggest applicability beyond the tested datasets, but empirical investigation is required for other settings (Santra et al., 2 Sep 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free