N2N-GQA: Graph-Based Hybrid QA

Updated 17 January 2026

The paper introduces a zero-shot open-domain QA system that transforms noisy retrieval outputs into dynamic evidence graphs for robust multi-hop reasoning.
It employs GraphRank to combine semantic relevance with weighted degree centrality, achieving up to a 19.9-point EM boost on OTT-QA.
The framework integrates structured query planning and iterative evidence selection to effectively bridge unstructured texts and tabular data.

N2N-GQA (“Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLMs”) is a zero-shot, open-domain multi-hop QA framework designed for hybrid corpora containing both unstructured text and tabular data. It operates by dynamically constructing query-dependent evidence graphs from noisy retrieval outputs and converting these structures into coherent reasoning narratives for LLMs. N2N-GQA prioritizes connectivity-aware evidence selection, introducing crucial graph-based mechanisms for multi-hop question answering—where reasoning requires bridging across intermediate facts and entities—while operating without task-specific training or fine-tuning.

1. Problem Setting and Motivational Context

N2N-GQA addresses open-domain, multi-hop QA over hybrid corpora $C = P \cup T$ , where $P$ is a set of text passages and $T$ a set of serialized table rows. Standard Retrieval-Augmented Generation (RAG) approaches process candidate evidence as flat ranked lists, resulting in retrieval noise that impairs reasoning chains, particularly for multi-hop questions requiring chaining facts across document boundaries. N2N-GQA’s principal insight is that encoding retrieved items as graph nodes, with semantic relationships (edge weights via TF-IDF overlap) between them, enables the identification of bridge documents that mediate reasoning steps—a capability not present in traditional list-based retrieval. The framework is strictly zero-shot and does not rely on supervised signals from the target QA data.

The core contributions are as follows:

The first zero-shot open-domain QA system for hybrid corpora that constructs dynamic evidence graphs from noisy retrieval results.
A noise-to-narrative mechanism involving graph construction, centrality-based pruning of bridge nodes, and narrative synthesis.
Introduction of GraphRank—a lightweight node ranking function integrating semantic retrieval scores with graph centrality.
Empirical demonstration of a 19.9-point EM gain on OTT-QA over standard RAG baselines, with performance matching finetuned retrieval systems absent any training (Sharafath et al., 10 Jan 2026).

2. Evidence Graph Construction and Scoring

Given input question $Q$ , a ColBERTv2 semantic retriever is used to assemble a candidate set $D = \{d_1, ..., d_k\}$ , each $d_i$ representing either a passage or table row. N2N-GQA builds an undirected evidence graph $G = (V, E)$ over $D$ :

Nodes ( $V$ ): Each node $v$ corresponds to a retrieved document $d$ .
Edges ( $E$ ): An edge $(u, v)$ exists if $u$ and $v$ share at least one term; its weight is the sum of TF-IDF scores over all shared terms:

$w(u,v) = \sum_{t \in tokens(u) \cap tokens(v)} \text{TF-IDF}(t)$

Semantic Labeling: Edges are not typed, but represent lexical, entity, or phrase overlap.

GraphRank Function

Each node $v$ is assigned:

$S_{sem}(v)$ : the retrieval (ColBERTv2) score.
$S_{struct}(v)$ : weighted degree centrality in $G$ :

$S_{struct}(v) = \sum_{u \in V} w(u,v)$

Both scores are normalized to $[0, 1]$ , yielding $S_{sem\_norm}(v)$ and $S_{struct\_norm}(v)$ . The final node score (Equation 1) is:

$\text{Score}_{GR}(v) = S_{sem\_norm}(v) \times (1 + (1-\alpha) S_{struct\_norm}(v))$

with $\alpha = 0.85$ ; semantic relevance gates the structural boost such that only high-relevance nodes benefit from graph centrality.

3. End-to-End Pipeline Architecture

N2N-GQA’s operation is structured in four sequential stages:

Structured Query Planning: An LLM is prompted to decompose $Q$ into a sequence of sub-queries, each classified as 1-, 2-, or 3-hop and templated with expected entity types. This yields a stepwise reasoning plan.
Iterative Evidence Gathering & Entity Extraction: For each hop $i$ $i$ ,
- Retrieve the top- $k$ candidates via ColBERTv2.
- Construct a local graph $G_i$ with TF-IDF edges.
- Re-rank and prune $G_i$ nodes via GraphRank, selecting the top 5–10 for LLM-driven entity extraction ( $e_i$ ).
- Use $e_i$ to instantiate the next sub-query.
Global Evidence Pool & Bridge-Aware Selection: Merge retrieved docs from all hops into a unified pool. Partition into passages $\mathcal{P}$ and tables $\mathcal{T}$ , then select top items $p_{top}$ and $t_{top}$ by semantic score. The linking function $\varphi(p, t)$ checks overlap between table cell values and passage terms, modulating scores via a priority boost $\beta$ . Algorithm 1 details this bridge-aware selection.
Final Graph and Answer Synthesis: Over the top $\sim$ 50 global candidates, a final evidence graph $G^*$ is built and pruned (GraphRank to 12–25 nodes, guaranteeing $\geq 2$ passages and tables), forming a narrative chain that, along with the structured reasoning plan and original $Q$ , is submitted to the LLM for answer synthesis.

4. Algorithmic Complexity and Implementation Details

Graph Construction: Per hop, complexity is $O(k^2)$ for TF-IDF overlap; $k \approx 20$ keeps this tractable.
Centrality: Weighted degree is $O(|E|)$ per graph.
GraphRank: Node-wise normalization and scoring is $O(|V|)$ .
Bridge Selector: Linear in $|\mathcal{P}| + |\mathcal{T}|$ per global pool.
Pipeline Cost: Typically $O(H)$ retrieval calls and $O(H+1)$ graph builds for $H$ hops ( $H \leq 2$ ). LLM inference constitutes the main computational bottleneck.

5. Empirical Evaluation and Results

Datasets and Models

HybridQA dev (200 questions, retrieval mode)
OTT-QA dev (500 questions, sampled)
Reader LLMs: GPT-4o, GPT-4.1, Llama3-70B
Metrics: Exact Match (EM), token-level F1, Precision, Recall, BERTScore-F1

Ablation Study (OTT-QA)

Method	EM	F1	P	R	Bert-F1
Vanilla RAG	28.60	40.82	31.16	38.39	34.57
+ Query Decomposition	28.60	40.82	31.16	38.39	34.57
N2N-GQA w/o GraphRank	48.50	56.90	58.61	58.31	58.22
N2N-GQA w/ GraphRank	48.80	57.26	59.78	58.28	58.76

Query decomposition delivers a $\sim$ 20 EM increase over Vanilla RAG.
Graph-based curation (without GraphRank) yields a 19.9-point EM jump.
GraphRank contributes a further modest +0.3 EM.

Comparison with Fine-Tuned SOTA (OTT-QA)

Method	EM	F1
COS	56.9	63.2
CORE	49.0	55.7
N2N-GQA†	48.80	57.26

Performance matches CORE (49.0 EM) and approaches COS (56.9 EM) while remaining zero-shot.

HybridQA Results

Method	EM	F1
Vanilla RAG	9.50	14.05
+ Query Decomp	22.00	31.06
N2N-GQA w/o GraphRank	41.00	48.38
N2N-GQA w/ GraphRank	41.50	48.17

6. Ablation, Interpretability, and Error Analysis

The most critical improvement comes from graph curation—transforming flat lists into connectivity-aware graphs boosts EM substantially.
GraphRank reliably identifies bridge nodes, contributing $\sim$ 0.3–1.0 EM.
The Bridge-Aware Hybrid Selector marginally improves the integration of passage and table evidence.
Error analysis identifies failure modes: evidence not retrieved, intermediate entity ambiguity, limitations inherent in table serialization.

7. Limitations, Future Directions, and Applications

Computational Demands: The heavy reliance on large LLM inference is costly. Mitigations may include knowledge distillation or more efficient query planning.
Graph Simplicity: The TF-IDF-based edge scheme overlooks richer semantic or relational information. Extension to embedding-based or external knowledge graph links is suggested as future work.
GraphRank Impact: The incremental gains from centrality imply other graph metrics (e.g., PageRank, betweenness) or adaptive $\alpha$ may warrant investigation.
Generalization of Paradigm: The noise-to-narrative mechanism is extensible to related tasks—fact verification, multi-document summarization, or knowledge-base construction—requiring reasoning over hybrid and noisy evidence sources.

N2N-GQA demonstrates the efficacy of dynamic evidence graph construction and principled graph-based pruning, achieving robust and interpretable zero-shot multi-hop reasoning across hybrid table-text corpora. Connectivity among evidence, rather than per-document relevance alone, is decisive for assembling effective reasoning chains; simple, transparent graph algorithms are shown to rival complex fine-tuned pipelines (Sharafath et al., 10 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

N2N-GQA: Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to N2N-GQA.