Papers
Topics
Authors
Recent
Search
2000 character limit reached

HF-RAG: Hierarchical Fusion-based RAG with Multiple Sources and Rankers

Published 2 Sep 2025 in cs.IR and cs.AI | (2509.02837v1)

Abstract: Leveraging both labeled (input-output associations) and unlabeled data (wider contextual grounding) may provide complementary benefits in retrieval augmented generation (RAG). However, effectively combining evidence from these heterogeneous sources is challenging as the respective similarity scores are not inter-comparable. Additionally, aggregating beliefs from the outputs of multiple rankers can improve the effectiveness of RAG. Our proposed method first aggregates the top-documents from a number of IR models using a standard rank fusion technique for each source (labeled and unlabeled). Next, we standardize the retrieval score distributions within each source by applying z-score transformation before merging the top-retrieved documents from the two sources. We evaluate our approach on the fact verification task, demonstrating that it consistently improves over the best-performing individual ranker or source and also shows better out-of-domain generalization.

Summary

  • The paper presents a novel hierarchical fusion strategy that aggregates multiple retrieval models using reciprocal rank fusion and z-score transformation.
  • It demonstrates superior in-domain and out-of-domain performance on datasets such as FEVER, Climate-FEVER, and SciFact through improved document relevance.
  • The method addresses non-comparability issues between labeled and unlabeled sources, setting a new benchmark for automated fact verification.

HF-RAG: Hierarchical Fusion-based RAG with Multiple Sources and Rankers

The paper "HF-RAG: Hierarchical Fusion-based RAG with Multiple Sources and Rankers" presents a novel approach for improving the performance of retrieval-augmented generation (RAG) by leveraging both labeled and unlabeled data sources. The method integrates multiple retrieval models to enhance fact verification processes by aggregating and standardizing evidence retrieved from diverse sources, thus addressing non-comparability issues in similarity scores.

Introduction

Information present on social media and online platforms is often vulnerable to misinformation dissemination, making automated fact-checking increasingly important. The accuracy of such systems depends significantly on the retrieval of relevant documents from large pools of information. RAG models that rely on both labeled and unlabeled data have shown promise but face challenges in seamlessly integrating these heterogeneous data types due to incompatible scoring metrics. This paper introduces a hierarchical fusion-based RAG (HF-RAG) methodology aimed at overcoming these barriers by utilizing multiple rankers and a refined aggregation technique.

Methodology: Hierarchical Fusion Process

The HF-RAG approach involves a two-tier hierarchical fusion mechanism:

  1. Intra-Source Ranker Fusion: Each source, labeled or unlabeled, leverages multiple information retrieval (IR) models to assemble top-ranked documents. A standard rank fusion technique, Reciprocal Rank Fusion (RRF), aggregates the outputs from these models to enhance the relevance of retrieved documents. Figure 1

    Figure 1: Our proposed approach HF-RAG leverages both labeled and unlabeled data to provide sub-topic-specific contextual information.

  2. Inter-Source Z-score Transformation: To merge lists from different sources, HF-RAG applies a z-score transformation, standardizing each document score into a common scale. This transformation facilitates a fair comparison across sources, enabling the extraction of a more coherently ranked list of documents. Figure 2

    Figure 2: Schematic overview of our proposed method HF-RAG. For a given claim, multiple retrievers are employed to obtain top-ranked documents from labeled and unlabeled sources. These top-documents for each source are combined via reciprocal rank fusion (RRF). These fused lists of non-overlapping documents from the two sources are then merged with a z-score transformation.

Experimental Setup

The evaluation of HF-RAG was conducted using the FEVER, Climate-FEVER, and SciFact datasets. FEVER provides a foundational in-domain evaluation, while Climate-FEVER and SciFact serve as out-of-domain tests to assess the generalization capabilities of the proposed model. Retrieval models, including BM25, Contriever, and ColBERT, were utilized to test the method's effectiveness in different configurations.

Results and Implications

HF-RAG demonstrated superior performance compared to both in-domain setups and out-of-domain generalizations.

  • Improvement in OOD Performance: It consistently outperformed baseline RAG configurations, highlighting its robust applicability to scenarios involving domain shifts.
  • Contribution of Z-score Standardization: The application of z-score transformation proved valuable in combining resources from disparate domains, outperforming proportional selection techniques by providing a normalized space for document evaluation. Figure 3

Figure 3

Figure 3

Figure 3: FEVER dataset

Figure 4

Figure 4

Figure 4: Sensitivity on SciFact

The results underscored the critical role of IR ranker fusion in enhancing retrieval effectiveness directly tied to downstream task performance, such as fact verification. The ability of HF-RAG to dynamically adjust and optimize document selection from varied sources facilitates superior decision-making in factual consistency assessments.

Conclusion

The hierarchical fusion-based RAG approach presented in this paper leverages the complementary strengths of labeled and unlabeled data sources through sophisticated inter-ranker and inter-source integration techniques. By addressing the comparability of disparate data through z-score transformations, HF-RAG sets a new benchmark for both in-domain and cross-domain fact verification tasks. Future work may explore extending HF-RAG to incorporate multi-agent RAG systems, potentially further enhancing generalization and efficiency in automated fact-checking applications.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

HF-RAG: An easy explanation

What is this paper about?

This paper introduces a new way to help AI systems check facts, called HF-RAG. It combines two kinds of information to answer a claim:

  • Labeled examples (like practice problems with answers)
  • Unlabeled text (like helpful articles from Wikipedia)

It also combines results from several different “search helpers” to find the best evidence. The goal is to make fact-checking more accurate, especially for topics the AI hasn’t seen before.


Main goal of the paper

The researchers want to make AI better at verifying statements like “Polar bears are going extinct because of global warming.” To do that, they:

  • Mix information from both labeled data (claims with known answers) and unlabeled data (general knowledge articles)
  • Combine the top results from multiple search systems
  • Standardize scores so they can fairly compare results from different sources

They test this idea on fact-checking tasks and show it works better than common methods.


Key questions the paper asks

The paper focuses on four simple questions:

  1. Does combining labeled and unlabeled information help the AI work better on new topics?
  2. What helps more: using multiple search systems or using multiple information sources?
  3. If the AI finds better evidence, does it actually give better answers?
  4. How sensitive is the method to how many examples are included as context?

How the method works, in simple terms

Think of the system like a student checking a claim using two backpacks:

  • Backpack 1: Labeled data (examples of claims with correct answers). This helps with task-specific patterns.
  • Backpack 2: Unlabeled data (articles from Wikipedia). This helps with broader knowledge.

HF-RAG combines both backpacks wisely and uses multiple “search buddies” to fetch the best items from each.

Important terms, explained

  • Retrieval-Augmented Generation (RAG): The AI first retrieves helpful text from outside sources, then uses it to generate an answer.
  • Labeled vs. Unlabeled:
    • Labeled: Claims with known answers (support/refute/not enough info).
    • Unlabeled: Articles without answers, like Wikipedia pages.
  • Rankers: Different search tools that pick the best matching documents for a claim (e.g., BM25, Contriever, ColBERT, MonoT5).
  • Reciprocal Rank Fusion (RRF): A way to merge the top results from multiple rankers. Imagine each ranker makes a top-10 list; RRF gives extra points to documents that appear high across several lists.
  • Z-score standardization: A way to make scores comparable across different sources. Think of it like comparing test scores from different schools by converting them to a common scale.

The HF-RAG approach in two steps

  • Step 1: Combine multiple rankers within each source (labeled or unlabeled)
    • For labeled and unlabeled sources separately, use RRF to merge the top documents from several rankers into one strong list per source.
  • Step 2: Combine the two sources
    • You can’t directly compare scores from different sources (they’re on different scales).
    • So, use z-scores to standardize each source’s scores, then merge the two lists into one final list.
    • The AI then uses this merged context to decide if the claim is supported, refuted, or unclear.

They test this using LLMs like LLaMA 2 (70B) and Mistral (7B), and datasets like FEVER (in-domain), Climate-FEVER, and SciFact (out-of-domain, i.e., different topics).


What did they find?

Here are the key results, explained simply:

  • Using both labeled and unlabeled sources together works best.
    • HF-RAG outperforms methods that only use labeled examples (L-RAG) or only use Wikipedia (U-RAG).
    • It does especially well on topics outside the training domain, like scientific claims in SciFact.
  • Using multiple search tools improves performance even before combining sources.
    • Combining rankers with RRF helps get better evidence lists.
    • Then combining labeled and unlabeled lists with z-scores improves results further.
  • Z-score mixing beats simple “fixed mixing.”
    • A simple method that picks a fixed percent from each source performs worse than z-score standardization, which adapts to the scores and compares fairly.
  • Better evidence leads to better answers.
    • When retrieval quality improves (measured with a ranking metric called nDCG@10), final fact-checking scores (macro F1) improve too.
  • The method is stable and not too sensitive to “how many examples you include.”
    • HF-RAG stays strong across different context sizes, and works well with about 10 examples in the prompt.

Why this matters

  • It helps fight misinformation: HF-RAG gives AI systems a stronger way to check facts by combining the precision of labeled examples and the breadth of general knowledge.
  • It works across topics: Because it balances task-specific patterns with broad knowledge, it handles new domains better (like science or climate facts).
  • It’s practical: The approach runs at inference time (no extra training needed), so it can be plugged into existing RAG systems.
  • It’s a foundation for smarter systems: The authors suggest future work could add reasoning components or multi-agent designs that use search even more intelligently.

In short, HF-RAG is like giving an AI both a well-marked study guide and a trustworthy library, plus several helpful librarians—and teaching it how to combine everything fairly. That makes it better at deciding whether claims are true, false, or uncertain.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 4 likes about this paper.