BridgeRAG: Training-Free Bridge-Conditioned Retrieval for Multi-Hop Question Answering

Published 3 Apr 2026 in cs.IR | (2604.03384v1)

Abstract: Multi-hop retrieval is not a single-step relevance problem: later-hop evidence should be ranked by its utility conditioned on retrieved bridge evidence, not by similarity to the original query alone. We present BridgeRAG, a training-free, graph-free retrieval method for retrieval-augmented generation (RAG) over multi-hop questions that operationalizes this view with a tripartite scorer s(q,b,c) over (question, bridge, candidate). BridgeRAG separates coverage from scoring: dual-entity ANN expansion broadens the second-hop candidate pool, while a bridge-conditioned LLM judge identifies the active reasoning chain among competing candidates without any offline graph or proposition index. Across four controlled experiments we show that this conditioning signal is (i) selective: +2.55pp on parallel-chain queries (p<0.001) vs. ~0 on single-chain subtypes; (ii) irreplaceable: substituting the retrieved passage with generated SVO query text reduces R@5 by 2.1pp, performing worse than even the lowest-SVO-similarity pool passage; (iii) predictable: cos(b,g2) correlates with per-query gain (Spearman rho=0.104, p<0.001); and (iv) mechanistically precise: bridge conditioning causes productive re-rankings (18.7% flip-win rate on parallel-chain vs. 0.6% on single-chain), not merely more churn. Combined with lightweight coverage expansion and percentile-rank score fusion, BridgeRAG achieves the best published training-free R@5 under matched benchmark evaluation on all three standard MHQA benchmarks without a graph database or any training: 0.8146 on MuSiQue (+3.1pp vs. PropRAG, +6.8pp vs. HippoRAG2), 0.9527 on 2WikiMultiHopQA (+1.2pp vs. PropRAG), and 0.9875 on HotpotQA (+1.35pp vs. PropRAG).

Abstract PDF Upgrade to Chat

Authors (1)

Andre Bacellar

Summary

The paper introduces BridgeRAG, a novel training-free paradigm that decomposes multi-hop retrieval using a bridge-conditioned tripartite scoring function.
It leverages open-weight embedding models and local LLM inference for effective hop-1 bridge selection and candidate expansion, achieving significant R@5 gains.
Empirical results highlight the method’s practical impact, with accurate bridge selection being crucial for disambiguating complex reasoning chains in multi-hop QA.

BridgeRAG: A Training-Free Bridge-Conditioned Paradigm for Multi-Hop Retrieval

Problem Formulation and Motivation

Multi-hop question answering (MHQA) necessitates the identification of a sequential chain of supporting facts, typically passages, that collectively resolve complex queries involving intermediate entities or latent reasoning steps. Conventional retrieval pipelines, based on scoring passages purely by their similarity to the original query, are fundamentally misaligned with the structural nature of multi-hop reasoning. In particular, the pivotal passage that connects the initial query context to the final target (henceforth, the "bridge" passage) often does not appear in any surface form within the query, but conditions the relevance of downstream evidence.

BridgeRAG introduces an explicit decomposition between evidence coverage and chain-aware scoring, replacing the canonical $s(q, c)$ function with a bridge-conditioned tripartite scoring function $s(q, b, c)$ , where $q$ is the query, $b$ is the retrieved bridge passage, and $c$ is a candidate passage. This operationalizes the theoretical observation that the conditional utility of a second-hop candidate is a function of both the query and the bridge, not reducible to query-only semantics. The approach obviates the need for graph-structured preprocessing or offline proposition indexing, in contrast to prior systems such as HippoRAG2 (Gutiérrez et al., 20 Feb 2025) and PropRAG [proprag2025].

Methodology

BridgeRAG is a training-free, graph-free pipeline that combines expanded candidate coverage with chain-disambiguated reranking, relying exclusively on open-weight embedding models (NV-Embed-v2) and local LLM inference (Llama 3.3 70B). The pipeline is composed of several key components:

Hop-1 Retrieval and Bridge Selection: The query is embedded, and ANN search surfaces the top-1 bridge passage, $b$ , representing the most immediate entity or context pivot.
SVO-based Hop-2 Candidate Expansion: LLM-generated Subject-Verb-Object (SVO) forms, conditioned on the query and bridge, are used to create three targeted ANN retrievals, yielding a diverse set of plausible second-hop support candidates.
Dual-Entity ANN Expansion: Key entities are extracted from the bridge passage and independently used as retrieval seeds, ensuring the candidate pool includes passages structurally or relationally proximal to the bridge.
Bridge-Conditioned Tripartite Judging: Each candidate passage is scored by an LLM using a joint prompt comprising the query, bridge, extracted entities, and the candidate itself, implementing the $s(q, b, c)$ signal.
Score Fusion and Reranking: The SVO-based and judge-based scores are normalized (percentile-rank/PIT) and fused, with final top- $k$ selection based on the fused value.

Notably, the methodology explicitly separates expansion (pool coverage) from reranking (scoring), permitting robust ablation of each subcomponent and enabling interpretable examination of selective gains from bridge conditioning.

Empirical Results

BridgeRAG establishes new state-of-the-art R@5 performance for training-free, non-graph MHQA retrieval:

MuSiQue: 0.8146 (+3.1pp vs. PropRAG, +6.8pp vs. HippoRAG2)
2WikiMultiHopQA: 0.9527 (+1.2pp vs. PropRAG)
HotpotQA: 0.9875 (+1.35pp vs. PropRAG)

Strong, significant selective effects are observed on parallel-chain (bridge-comparison) queries, with bridge conditioning contributing a statistically significant +2.55pp R@5 ( $p<$ 0.001), whereas gains on single-chain subtypes are null. Substituting the true bridge with either generated queries or semantically distant passages degrades performance, evidencing the ineluctable role of actual bridge content as opposed to generic context expansion.

Error analysis demonstrates that the dominant limiting factors are bridge selection accuracy (38%) and candidate pool misses (31%), not deficiencies in the chain-aware judge or overfitting to specific benchmarks or hyperparameters.

Theoretical and Practical Implications

BridgeRAG's tripartite scoring function is directly motivated by an information-theoretic framework: the entropy of reachable gold passages given only the query is high for bridge-comparison instances, and mutual information with the bridge is both necessary and sufficient to resolve the active reasoning chain. Empirical experiments (e.g., correlation between bridge proximity and delta gain, productivity of flip-induced re-rankings) substantiate this formal intuition.

The practical consequence is that training-free MHQA can reach or exceed systems with complex offline graph construction by leveraging LLM-based, bridge-conditioned discriminators, without introducing any new corpus-level preprocessing dependencies. BridgeRAG's architecture thus enables immediate deployment on new, unseen corpora with only passage embedding preprocessing.

Limitations and Future Directions

Despite outperforming prior work, several limitations are inherent. BridgeRAG's efficacy depends on robust identification of the bridge; errors in hop-1 propagate directly to scoring. The tripartite judge processes up to 20 candidate passages jointly, creating context length and latency constraints, with the LLM judge dominating end-to-end latency (4–5 seconds per query on high-end GPU hardware). The approach's benefit is pronounced only in high-ambiguity, chain-disambiguation settings; for single-chain or directly inferable queries, the conditional utility signal is effectively redundant. Future research could generalize the conditioning context to sets of bridge passages, explore multi-hop extension beyond two steps, or reintroduce partial structural inductive bias without reliance on explicit graph construction.

Conclusion

BridgeRAG provides a principled, fully training-free mechanism for chain-sensitive multi-hop retrieval, establishing the sufficiency of explicit bridge-conditioned reranking in disambiguating reasoning paths for MHQA. Its selective, mechanistically precise, and irreplaceable conditioning signal yields robust improvements over graph-based and proposition-based alternatives under equal computational and model constraints. This method reframes the retrieval target for multi-hop QA: relevance is not a query-only property but a conditional utility modulated by intermediate entities, effectively operationalized via efficient, open-weight LLMs (2604.03384).

Markdown Report Issue