Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

HF-RAG: Hierarchical Fusion for RAG

Updated 18 November 2025
  • The paper demonstrates that hierarchical fusion of labeled and unlabeled retrieval outputs significantly improves fact verification performance compared to single-source approaches.
  • It employs intra-source reciprocal rank fusion and inter-source z-score standardization to effectively resolve score heterogeneity across different IR models.
  • Experimental results exhibit robust macro-F1 and nDCG gains, underlining HF-RAG’s adaptability and superior generalization on both in-domain and out-of-domain datasets.

Hierarchical Fusion-based RAG (HF-RAG) is an inference-only augmentation of Retrieval-Augmented Generation systems that integrates evidence from both labeled and unlabeled sources using a hierarchical score fusion pipeline. By systematically combining the outputs of multiple retrieval models and resolving the inherent score heterogeneity and complementary distributional coverage of labeled and unlabeled corpora, HF-RAG achieves robust improvements in fact verification tasks, especially for out-of-domain generalization (Santra et al., 2 Sep 2025).

1. Motivation and Problem Formulation

Standard RAG systems typically perform retrieval from a single source, such as Wikipedia, using a single retrieval model. This design limits adaptability: labeled sources encode explicit input-label associations (task-specific grounding), while unlabeled sources capture broader exogenous knowledge (contextual coverage). When multiple retrievers and sources are available, three central challenges arise:

  • Score Heterogeneity: Retrieval scores from distinct IR models or corpora are not directly comparable, impeding effective aggregation.
  • Ranker Complementarity: No single IR model (BM25, Contriever, ColBERT, MonoT5) is uniformly optimal across queries or domains.
  • Source Complementarity: Over-reliance on labeled or unlabeled sources leads to sub-optimal in-domain or out-of-domain performance.

Combining labeled (e.g., FEVER claim-label pairs) and unlabeled (e.g., Wikipedia) data can leverage the strengths of each: Labeled RAG (L-RAG) provides task-specific anchoring, while Unlabeled RAG (U-RAG) mitigates overfitting and enhances generalization. However, naïve merging is undermined by mismatched score distributions and disparate ranking criteria (Santra et al., 2 Sep 2025).

2. HF-RAG Fusion Architecture

The hierarchical fusion pipeline operates in three stages:

  1. Intra-Source Ranker Fusion: For each source C{l,u}C \in \{l, u\} (labeled or unlabeled), retrieve top-kk document lists LkC,θL_k^{C,\theta} from each IR model θΘ\theta \in \Theta. Fuse these lists using Reciprocal Rank Fusion (RRF), computing scores:

θC(d)=θΘ1rank(LkC,θ,d)\overline{\theta_C}(d) = \sum_{\theta \in \Theta} \frac{1}{\text{rank}(L_k^{C,\theta}, d)}

The top-kk documents by fused score form LkCL_k^C.

  1. Inter-Source Score Standardization and Merge: For each source, compute the mean and standard deviation (μC,σC)(\mu_C, \sigma_C) of RRF scores in LkCL_k^C and standardize document scores:

ϕC(d)=sC(d)μCσC\phi_C(d) = \frac{s_C(d) - \mu_C}{\sigma_C}

Merge LklL_k^l and LkuL_k^u by sorting all documents by the standardized scores ϕC(d)\phi_C(d) to form the final retrieval set LkL_k.

  1. Context Construction for Generation: Concatenate the top-mm ranked passages in LkL_k to the input prompt for the LLM.

This method requires no additional training, learned fusion weights, or supervised fine-tuning; it operates solely at inference time.

3. Experimental Setup and Evaluation

HF-RAG was evaluated primarily on the fact verification task, using both in-domain (FEVER) and out-of-domain (Climate-FEVER, SciFact) datasets. The retriever ensemble comprised BM25, Contriever, ColBERT, and MonoT5:

Retriever Retrieval Paradigm Target Corpus
BM25 Sparse, lexical Labeled + Unlabeled
Contriever Dense, bi-encoder Labeled + Unlabeled
ColBERT Dense, late interaction Labeled + Unlabeled
MonoT5 Cross-encoder re-ranking Labeled + Unlabeled
  • Sources: l=l= FEVER train claim-label pairs; u=u= 2018 Wikipedia dump.
  • Generators: LLaMA-2-70B Chat; Mistral-7B Instruct.
  • Baselines: Parametric SFT, 0-shot, L-RAG, U-RAG, L-RAG-RRF, U-RAG-RRF, LU-RAG-α\alpha (linear mixing), RAG-OptSel (oracle).

Performance was measured using macro-F1 (3-way classification) and nDCG@10 for retrieval.

HF-RAG surpassed all baselines in macro-F1:

  • FEVER: 0.5744 (LLaMA), 0.5628 (Mistral)
  • Climate-FEVER: 0.4838/0.5019 (vs. best single ≈0.5083)
  • SciFact: 0.4320/0.4341 (vs. best single ≈0.4246) These results demonstrate that hierarchical fusion not only surpasses the best single ranker/source RAG but even the oracle single selection bound (Santra et al., 2 Sep 2025).

4. Detailed Fusion Algorithms

Intra-Source Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion combines outputs from multiple rankers for a given source by aggregating the reciprocal ranks, yielding robust list-level fusion:

θC(d)=θΘ1rank(LkC,θ,d)\overline{\theta_C}(d) = \sum_{\theta \in \Theta} \frac{1}{\text{rank}(L_k^{C,\theta}, d)}

Documents not present in a given list are assigned a rank MkM \gg k. The fused list LkCL_k^C consists of the top-kk documents by RRF score.

Inter-Source z-Score Standardization

To address inter-source score scale discrepancies, HF-RAG computes:

μC=1DCdDCsC(d),σC=1DCdDC(sC(d)μC)2\mu_C = \frac{1}{|D_C|}\sum_{d\in D_C} s_C(d), \quad \sigma_C = \sqrt{\frac{1}{|D_C|} \sum_{d\in D_C}(s_C(d)-\mu_C)^2}

Standardized scores are

ϕC(d)=sC(d)μCσC\phi_C(d) = \frac{s_C(d) - \mu_C}{\sigma_C}

This adjustment permits fair comparison and merging of top-kk documents from labeled and unlabeled sources.

Pipeline Summary (Pseudocode)

The following summarises the retrieval and fusion process:

1
2
3
4
5
6
7
8
9
10
11
12
for C in {labeled, unlabeled}:
    for theta in retrievers:
        L_k[C, theta] = retrieve_top_k_docs(x, theta, C)
    s_C = compute_RRF_scores(L_k[C, :])
    D_C = select_top_k_docs(s_C)
    mu_C, sigma_C = mean_std(s_C[D_C])
    phi_C = (s_C[D_C] - mu_C) / sigma_C

D_all = D_labeled  D_unlabeled
L_k = select_top_k_docs_by_phi(D_all)
context = concatenate_text(L_k, x)
ŷ = LLM.generate(context)

5. Comparative Analysis and Insights

Ablation studies reveal that both intra-source RRF and inter-source z-score standardization are critical to performance:

  • Within-Source Fusion: L-RAG-RRF and U-RAG-RRF consistently outperform single ranker L-RAG and U-RAG, validating the complementary nature of distinct IR models.
  • Between-Source Fusion: LU-RAG-α\alpha (linear mixing) is less effective than z-score standardization, underscoring the necessity of adjusting for collection bias (μC,σC\mu_C, \sigma_C).
  • Generalization: On SciFact, HF-RAG provides relative improvements of 3–5 F1 points over baselines, indicating resilience to domain shift. OOD sensitivity analyses show adaptive source weighting: greater dependence on unlabeled evidence under high distributional shift and increased use of labeled context when domains are more aligned.
  • Robustness to kk: HF-RAG maintains high performance across a range of context sizes kk, with performance plateauing beyond k10k \approx 10.

6. Integration with Generative Models

HF-RAG is architected as an inference-only extension of standard RAG. There are no additional trainable fusion layers; the top-kk fused documents are concatenated (or encoded as in Fusion-in-Decoder paradigms) and fed to the generative LLM, which synthesizes the final response. Thus, the pipeline is modular and requires no retraining or parameter adaptation for downstream tasks. The only tested generation tasks in the original work are fact verification using LLaMA-2-70B Chat and Mistral-7B Instruct (Santra et al., 2 Sep 2025).

7. Limitations and Future Directions

HF-RAG’s hierarchical fusion remains fully unsupervised, using no learned fusion weights or context-dependent weighting; learned or adaptive methods could further improve retrieval and generation. The evaluation is restricted to fact verification; potential extensions include open-domain QA or multi-agent reasoning tasks. Current results suggest applicability beyond the tested datasets, but empirical investigation is required for other settings (Santra et al., 2 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Hierarchical Fusion-based RAG (HF-RAG).