N2N-GQA: Graph-Based Hybrid QA
- The paper introduces a zero-shot open-domain QA system that transforms noisy retrieval outputs into dynamic evidence graphs for robust multi-hop reasoning.
- It employs GraphRank to combine semantic relevance with weighted degree centrality, achieving up to a 19.9-point EM boost on OTT-QA.
- The framework integrates structured query planning and iterative evidence selection to effectively bridge unstructured texts and tabular data.
N2N-GQA (“Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLMs”) is a zero-shot, open-domain multi-hop QA framework designed for hybrid corpora containing both unstructured text and tabular data. It operates by dynamically constructing query-dependent evidence graphs from noisy retrieval outputs and converting these structures into coherent reasoning narratives for LLMs. N2N-GQA prioritizes connectivity-aware evidence selection, introducing crucial graph-based mechanisms for multi-hop question answering—where reasoning requires bridging across intermediate facts and entities—while operating without task-specific training or fine-tuning.
1. Problem Setting and Motivational Context
N2N-GQA addresses open-domain, multi-hop QA over hybrid corpora , where is a set of text passages and a set of serialized table rows. Standard Retrieval-Augmented Generation (RAG) approaches process candidate evidence as flat ranked lists, resulting in retrieval noise that impairs reasoning chains, particularly for multi-hop questions requiring chaining facts across document boundaries. N2N-GQA’s principal insight is that encoding retrieved items as graph nodes, with semantic relationships (edge weights via TF-IDF overlap) between them, enables the identification of bridge documents that mediate reasoning steps—a capability not present in traditional list-based retrieval. The framework is strictly zero-shot and does not rely on supervised signals from the target QA data.
The core contributions are as follows:
- The first zero-shot open-domain QA system for hybrid corpora that constructs dynamic evidence graphs from noisy retrieval results.
- A noise-to-narrative mechanism involving graph construction, centrality-based pruning of bridge nodes, and narrative synthesis.
- Introduction of GraphRank—a lightweight node ranking function integrating semantic retrieval scores with graph centrality.
- Empirical demonstration of a 19.9-point EM gain on OTT-QA over standard RAG baselines, with performance matching finetuned retrieval systems absent any training (Sharafath et al., 10 Jan 2026).
2. Evidence Graph Construction and Scoring
Given input question , a ColBERTv2 semantic retriever is used to assemble a candidate set , each representing either a passage or table row. N2N-GQA builds an undirected evidence graph over :
- Nodes (): Each node corresponds to a retrieved document .
- Edges (): An edge exists if and share at least one term; its weight is the sum of TF-IDF scores over all shared terms:
- Semantic Labeling: Edges are not typed, but represent lexical, entity, or phrase overlap.
GraphRank Function
Each node is assigned:
- : the retrieval (ColBERTv2) score.
- : weighted degree centrality in :
Both scores are normalized to , yielding and . The final node score (Equation 1) is:
with ; semantic relevance gates the structural boost such that only high-relevance nodes benefit from graph centrality.
3. End-to-End Pipeline Architecture
N2N-GQA’s operation is structured in four sequential stages:
- Structured Query Planning: An LLM is prompted to decompose into a sequence of sub-queries, each classified as 1-, 2-, or 3-hop and templated with expected entity types. This yields a stepwise reasoning plan.
- Iterative Evidence Gathering & Entity Extraction: For each hop ,
- Retrieve the top- candidates via ColBERTv2.
- Construct a local graph with TF-IDF edges.
- Re-rank and prune nodes via GraphRank, selecting the top 5–10 for LLM-driven entity extraction ().
- Use to instantiate the next sub-query.
- Global Evidence Pool & Bridge-Aware Selection: Merge retrieved docs from all hops into a unified pool. Partition into passages and tables , then select top items and by semantic score. The linking function checks overlap between table cell values and passage terms, modulating scores via a priority boost . Algorithm 1 details this bridge-aware selection.
- Final Graph and Answer Synthesis: Over the top 50 global candidates, a final evidence graph is built and pruned (GraphRank to 12–25 nodes, guaranteeing passages and tables), forming a narrative chain that, along with the structured reasoning plan and original , is submitted to the LLM for answer synthesis.
4. Algorithmic Complexity and Implementation Details
- Graph Construction: Per hop, complexity is for TF-IDF overlap; keeps this tractable.
- Centrality: Weighted degree is per graph.
- GraphRank: Node-wise normalization and scoring is .
- Bridge Selector: Linear in per global pool.
- Pipeline Cost: Typically retrieval calls and graph builds for hops (). LLM inference constitutes the main computational bottleneck.
5. Empirical Evaluation and Results
Datasets and Models
- HybridQA dev (200 questions, retrieval mode)
- OTT-QA dev (500 questions, sampled)
- Reader LLMs: GPT-4o, GPT-4.1, Llama3-70B
- Metrics: Exact Match (EM), token-level F1, Precision, Recall, BERTScore-F1
Ablation Study (OTT-QA)
| Method | EM | F1 | P | R | Bert-F1 |
|---|---|---|---|---|---|
| Vanilla RAG | 28.60 | 40.82 | 31.16 | 38.39 | 34.57 |
| + Query Decomposition | 28.60 | 40.82 | 31.16 | 38.39 | 34.57 |
| N2N-GQA w/o GraphRank | 48.50 | 56.90 | 58.61 | 58.31 | 58.22 |
| N2N-GQA w/ GraphRank | 48.80 | 57.26 | 59.78 | 58.28 | 58.76 |
- Query decomposition delivers a 20 EM increase over Vanilla RAG.
- Graph-based curation (without GraphRank) yields a 19.9-point EM jump.
- GraphRank contributes a further modest +0.3 EM.
Comparison with Fine-Tuned SOTA (OTT-QA)
| Method | EM | F1 |
|---|---|---|
| COS | 56.9 | 63.2 |
| CORE | 49.0 | 55.7 |
| N2N-GQA† | 48.80 | 57.26 |
Performance matches CORE (49.0 EM) and approaches COS (56.9 EM) while remaining zero-shot.
HybridQA Results
| Method | EM | F1 |
|---|---|---|
| Vanilla RAG | 9.50 | 14.05 |
| + Query Decomp | 22.00 | 31.06 |
| N2N-GQA w/o GraphRank | 41.00 | 48.38 |
| N2N-GQA w/ GraphRank | 41.50 | 48.17 |
6. Ablation, Interpretability, and Error Analysis
- The most critical improvement comes from graph curation—transforming flat lists into connectivity-aware graphs boosts EM substantially.
- GraphRank reliably identifies bridge nodes, contributing 0.3–1.0 EM.
- The Bridge-Aware Hybrid Selector marginally improves the integration of passage and table evidence.
- Error analysis identifies failure modes: evidence not retrieved, intermediate entity ambiguity, limitations inherent in table serialization.
7. Limitations, Future Directions, and Applications
- Computational Demands: The heavy reliance on large LLM inference is costly. Mitigations may include knowledge distillation or more efficient query planning.
- Graph Simplicity: The TF-IDF-based edge scheme overlooks richer semantic or relational information. Extension to embedding-based or external knowledge graph links is suggested as future work.
- GraphRank Impact: The incremental gains from centrality imply other graph metrics (e.g., PageRank, betweenness) or adaptive may warrant investigation.
- Generalization of Paradigm: The noise-to-narrative mechanism is extensible to related tasks—fact verification, multi-document summarization, or knowledge-base construction—requiring reasoning over hybrid and noisy evidence sources.
N2N-GQA demonstrates the efficacy of dynamic evidence graph construction and principled graph-based pruning, achieving robust and interpretable zero-shot multi-hop reasoning across hybrid table-text corpora. Connectivity among evidence, rather than per-document relevance alone, is decisive for assembling effective reasoning chains; simple, transparent graph algorithms are shown to rival complex fine-tuned pipelines (Sharafath et al., 10 Jan 2026).