HF-RAG: Hierarchical Fusion for RAG

Updated 18 November 2025

The paper demonstrates that hierarchical fusion of labeled and unlabeled retrieval outputs significantly improves fact verification performance compared to single-source approaches.
It employs intra-source reciprocal rank fusion and inter-source z-score standardization to effectively resolve score heterogeneity across different IR models.
Experimental results exhibit robust macro-F1 and nDCG gains, underlining HF-RAG’s adaptability and superior generalization on both in-domain and out-of-domain datasets.

Hierarchical Fusion-based RAG (HF-RAG) is an inference-only augmentation of Retrieval-Augmented Generation systems that integrates evidence from both labeled and unlabeled sources using a hierarchical score fusion pipeline. By systematically combining the outputs of multiple retrieval models and resolving the inherent score heterogeneity and complementary distributional coverage of labeled and unlabeled corpora, HF-RAG achieves robust improvements in fact verification tasks, especially for out-of-domain generalization (Santra et al., 2 Sep 2025).

1. Motivation and Problem Formulation

Standard RAG systems typically perform retrieval from a single source, such as Wikipedia, using a single retrieval model. This design limits adaptability: labeled sources encode explicit input-label associations (task-specific grounding), while unlabeled sources capture broader exogenous knowledge (contextual coverage). When multiple retrievers and sources are available, three central challenges arise:

Score Heterogeneity: Retrieval scores from distinct IR models or corpora are not directly comparable, impeding effective aggregation.
Ranker Complementarity: No single IR model (BM25, Contriever, ColBERT, MonoT5) is uniformly optimal across queries or domains.
Source Complementarity: Over-reliance on labeled or unlabeled sources leads to sub-optimal in-domain or out-of-domain performance.

Combining labeled (e.g., FEVER claim-label pairs) and unlabeled (e.g., Wikipedia) data can leverage the strengths of each: Labeled RAG (L-RAG) provides task-specific anchoring, while Unlabeled RAG (U-RAG) mitigates overfitting and enhances generalization. However, naïve merging is undermined by mismatched score distributions and disparate ranking criteria (Santra et al., 2 Sep 2025).

2. HF-RAG Fusion Architecture

The hierarchical fusion pipeline operates in three stages:

Intra-Source Ranker Fusion: For each source $C \in \{l, u\}$ (labeled or unlabeled), retrieve top- $k$ document lists $L_k^{C,\theta}$ from each IR model $\theta \in \Theta$ . Fuse these lists using Reciprocal Rank Fusion (RRF), computing scores:

$\overline{\theta_C}(d) = \sum_{\theta \in \Theta} \frac{1}{\text{rank}(L_k^{C,\theta}, d)}$

The top- $k$ documents by fused score form $L_k^C$ .

Inter-Source Score Standardization and Merge: For each source, compute the mean and standard deviation $(\mu_C, \sigma_C)$ of RRF scores in $L_k^C$ and standardize document scores:

$\phi_C(d) = \frac{s_C(d) - \mu_C}{\sigma_C}$

Merge $L_k^l$ and $L_k^u$ by sorting all documents by the standardized scores $\phi_C(d)$ to form the final retrieval set $L_k$ .

Context Construction for Generation: Concatenate the top- $m$ ranked passages in $L_k$ to the input prompt for the LLM.

This method requires no additional training, learned fusion weights, or supervised fine-tuning; it operates solely at inference time.

3. Experimental Setup and Evaluation

HF-RAG was evaluated primarily on the fact verification task, using both in-domain (FEVER) and out-of-domain (Climate-FEVER, SciFact) datasets. The retriever ensemble comprised BM25, Contriever, ColBERT, and MonoT5:

Retriever	Retrieval Paradigm	Target Corpus
BM25	Sparse, lexical	Labeled + Unlabeled
Contriever	Dense, bi-encoder	Labeled + Unlabeled
ColBERT	Dense, late interaction	Labeled + Unlabeled
MonoT5	Cross-encoder re-ranking	Labeled + Unlabeled

Sources: $l=$ FEVER train claim-label pairs; $u=$ 2018 Wikipedia dump.
Generators: LLaMA-2-70B Chat; Mistral-7B Instruct.
Baselines: Parametric SFT, 0-shot, L-RAG, U-RAG, L-RAG-RRF, U-RAG-RRF, LU-RAG- $\alpha$ (linear mixing), RAG-OptSel (oracle).

Performance was measured using macro-F1 (3-way classification) and nDCG@10 for retrieval.

HF-RAG surpassed all baselines in macro-F1:

FEVER: 0.5744 (LLaMA), 0.5628 (Mistral)
Climate-FEVER: 0.4838/0.5019 (vs. best single ≈0.5083)
SciFact: 0.4320/0.4341 (vs. best single ≈0.4246) These results demonstrate that hierarchical fusion not only surpasses the best single ranker/source RAG but even the oracle single selection bound (Santra et al., 2 Sep 2025).

4. Detailed Fusion Algorithms

Intra-Source Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion combines outputs from multiple rankers for a given source by aggregating the reciprocal ranks, yielding robust list-level fusion:

$\overline{\theta_C}(d) = \sum_{\theta \in \Theta} \frac{1}{\text{rank}(L_k^{C,\theta}, d)}$

Documents not present in a given list are assigned a rank $M \gg k$ . The fused list $L_k^C$ consists of the top- $k$ documents by RRF score.

Inter-Source z-Score Standardization

To address inter-source score scale discrepancies, HF-RAG computes:

$\mu_C = \frac{1}{|D_C|}\sum_{d\in D_C} s_C(d), \quad \sigma_C = \sqrt{\frac{1}{|D_C|} \sum_{d\in D_C}(s_C(d)-\mu_C)^2}$

Standardized scores are

$\phi_C(d) = \frac{s_C(d) - \mu_C}{\sigma_C}$

This adjustment permits fair comparison and merging of top- $k$ documents from labeled and unlabeled sources.

Pipeline Summary (Pseudocode)

The following summarises the retrieval and fusion process:

for C in {labeled, unlabeled}:
    for theta in retrievers:
        L_k[C, theta] = retrieve_top_k_docs(x, theta, C)
    s_C = compute_RRF_scores(L_k[C, :])
    D_C = select_top_k_docs(s_C)
    mu_C, sigma_C = mean_std(s_C[D_C])
    phi_C = (s_C[D_C] - mu_C) / sigma_C

D_all = D_labeled ∪ D_unlabeled
L_k = select_top_k_docs_by_phi(D_all)
context = concatenate_text(L_k, x)
ŷ = LLM.generate(context)

5. Comparative Analysis and Insights

Ablation studies reveal that both intra-source RRF and inter-source z-score standardization are critical to performance:

Within-Source Fusion: L-RAG-RRF and U-RAG-RRF consistently outperform single ranker L-RAG and U-RAG, validating the complementary nature of distinct IR models.
Between-Source Fusion: LU-RAG- $\alpha$ (linear mixing) is less effective than z-score standardization, underscoring the necessity of adjusting for collection bias ( $\mu_C, \sigma_C$ ).
Generalization: On SciFact, HF-RAG provides relative improvements of 3–5 F1 points over baselines, indicating resilience to domain shift. OOD sensitivity analyses show adaptive source weighting: greater dependence on unlabeled evidence under high distributional shift and increased use of labeled context when domains are more aligned.
Robustness to $k$ : HF-RAG maintains high performance across a range of context sizes $k$ , with performance plateauing beyond $k \approx 10$ .

6. Integration with Generative Models

HF-RAG is architected as an inference-only extension of standard RAG. There are no additional trainable fusion layers; the top- $k$ fused documents are concatenated (or encoded as in Fusion-in-Decoder paradigms) and fed to the generative LLM, which synthesizes the final response. Thus, the pipeline is modular and requires no retraining or parameter adaptation for downstream tasks. The only tested generation tasks in the original work are fact verification using LLaMA-2-70B Chat and Mistral-7B Instruct (Santra et al., 2 Sep 2025).

7. Limitations and Future Directions

HF-RAG’s hierarchical fusion remains fully unsupervised, using no learned fusion weights or context-dependent weighting; learned or adaptive methods could further improve retrieval and generation. The evaluation is restricted to fact verification; potential extensions include open-domain QA or multi-agent reasoning tasks. Current results suggest applicability beyond the tested datasets, but empirical investigation is required for other settings (Santra et al., 2 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

HF-RAG: Hierarchical Fusion-based RAG with Multiple Sources and Rankers (2025)

Follow Topic

Get notified by email when new papers are published related to Hierarchical Fusion-based RAG (HF-RAG).