Papers
Topics
Authors
Recent
Search
2000 character limit reached

N2N-GQA: Graph-Based Hybrid QA

Updated 17 January 2026
  • The paper introduces a zero-shot open-domain QA system that transforms noisy retrieval outputs into dynamic evidence graphs for robust multi-hop reasoning.
  • It employs GraphRank to combine semantic relevance with weighted degree centrality, achieving up to a 19.9-point EM boost on OTT-QA.
  • The framework integrates structured query planning and iterative evidence selection to effectively bridge unstructured texts and tabular data.

N2N-GQA (“Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLMs”) is a zero-shot, open-domain multi-hop QA framework designed for hybrid corpora containing both unstructured text and tabular data. It operates by dynamically constructing query-dependent evidence graphs from noisy retrieval outputs and converting these structures into coherent reasoning narratives for LLMs. N2N-GQA prioritizes connectivity-aware evidence selection, introducing crucial graph-based mechanisms for multi-hop question answering—where reasoning requires bridging across intermediate facts and entities—while operating without task-specific training or fine-tuning.

1. Problem Setting and Motivational Context

N2N-GQA addresses open-domain, multi-hop QA over hybrid corpora C=PTC = P \cup T, where PP is a set of text passages and TT a set of serialized table rows. Standard Retrieval-Augmented Generation (RAG) approaches process candidate evidence as flat ranked lists, resulting in retrieval noise that impairs reasoning chains, particularly for multi-hop questions requiring chaining facts across document boundaries. N2N-GQA’s principal insight is that encoding retrieved items as graph nodes, with semantic relationships (edge weights via TF-IDF overlap) between them, enables the identification of bridge documents that mediate reasoning steps—a capability not present in traditional list-based retrieval. The framework is strictly zero-shot and does not rely on supervised signals from the target QA data.

The core contributions are as follows:

  • The first zero-shot open-domain QA system for hybrid corpora that constructs dynamic evidence graphs from noisy retrieval results.
  • A noise-to-narrative mechanism involving graph construction, centrality-based pruning of bridge nodes, and narrative synthesis.
  • Introduction of GraphRank—a lightweight node ranking function integrating semantic retrieval scores with graph centrality.
  • Empirical demonstration of a 19.9-point EM gain on OTT-QA over standard RAG baselines, with performance matching finetuned retrieval systems absent any training (Sharafath et al., 10 Jan 2026).

2. Evidence Graph Construction and Scoring

Given input question QQ, a ColBERTv2 semantic retriever is used to assemble a candidate set D={d1,...,dk}D = \{d_1, ..., d_k\}, each did_i representing either a passage or table row. N2N-GQA builds an undirected evidence graph G=(V,E)G = (V, E) over DD:

  • Nodes (VV): Each node vv corresponds to a retrieved document dd.
  • Edges (EE): An edge (u,v)(u, v) exists if uu and vv share at least one term; its weight is the sum of TF-IDF scores over all shared terms:

w(u,v)=ttokens(u)tokens(v)TF-IDF(t)w(u,v) = \sum_{t \in tokens(u) \cap tokens(v)} \text{TF-IDF}(t)

  • Semantic Labeling: Edges are not typed, but represent lexical, entity, or phrase overlap.

GraphRank Function

Each node vv is assigned:

  • Ssem(v)S_{sem}(v): the retrieval (ColBERTv2) score.
  • Sstruct(v)S_{struct}(v): weighted degree centrality in GG:

Sstruct(v)=uVw(u,v)S_{struct}(v) = \sum_{u \in V} w(u,v)

Both scores are normalized to [0,1][0, 1], yielding Ssem_norm(v)S_{sem\_norm}(v) and Sstruct_norm(v)S_{struct\_norm}(v). The final node score (Equation 1) is:

ScoreGR(v)=Ssem_norm(v)×(1+(1α)Sstruct_norm(v))\text{Score}_{GR}(v) = S_{sem\_norm}(v) \times (1 + (1-\alpha) S_{struct\_norm}(v))

with α=0.85\alpha = 0.85; semantic relevance gates the structural boost such that only high-relevance nodes benefit from graph centrality.

3. End-to-End Pipeline Architecture

N2N-GQA’s operation is structured in four sequential stages:

  1. Structured Query Planning: An LLM is prompted to decompose QQ into a sequence of sub-queries, each classified as 1-, 2-, or 3-hop and templated with expected entity types. This yields a stepwise reasoning plan.
  2. Iterative Evidence Gathering & Entity Extraction: For each hop ii,
    • Retrieve the top-kk candidates via ColBERTv2.
    • Construct a local graph GiG_i with TF-IDF edges.
    • Re-rank and prune GiG_i nodes via GraphRank, selecting the top 5–10 for LLM-driven entity extraction (eie_i).
    • Use eie_i to instantiate the next sub-query.
  3. Global Evidence Pool & Bridge-Aware Selection: Merge retrieved docs from all hops into a unified pool. Partition into passages P\mathcal{P} and tables T\mathcal{T}, then select top items ptopp_{top} and ttopt_{top} by semantic score. The linking function φ(p,t)\varphi(p, t) checks overlap between table cell values and passage terms, modulating scores via a priority boost β\beta. Algorithm 1 details this bridge-aware selection.
  4. Final Graph and Answer Synthesis: Over the top \sim50 global candidates, a final evidence graph GG^* is built and pruned (GraphRank to 12–25 nodes, guaranteeing 2\geq 2 passages and tables), forming a narrative chain that, along with the structured reasoning plan and original QQ, is submitted to the LLM for answer synthesis.

4. Algorithmic Complexity and Implementation Details

  • Graph Construction: Per hop, complexity is O(k2)O(k^2) for TF-IDF overlap; k20k \approx 20 keeps this tractable.
  • Centrality: Weighted degree is O(E)O(|E|) per graph.
  • GraphRank: Node-wise normalization and scoring is O(V)O(|V|).
  • Bridge Selector: Linear in P+T|\mathcal{P}| + |\mathcal{T}| per global pool.
  • Pipeline Cost: Typically O(H)O(H) retrieval calls and O(H+1)O(H+1) graph builds for HH hops (H2H \leq 2). LLM inference constitutes the main computational bottleneck.

5. Empirical Evaluation and Results

Datasets and Models

  • HybridQA dev (200 questions, retrieval mode)
  • OTT-QA dev (500 questions, sampled)
  • Reader LLMs: GPT-4o, GPT-4.1, Llama3-70B
  • Metrics: Exact Match (EM), token-level F1, Precision, Recall, BERTScore-F1

Ablation Study (OTT-QA)

Method EM F1 P R Bert-F1
Vanilla RAG 28.60 40.82 31.16 38.39 34.57
+ Query Decomposition 28.60 40.82 31.16 38.39 34.57
N2N-GQA w/o GraphRank 48.50 56.90 58.61 58.31 58.22
N2N-GQA w/ GraphRank 48.80 57.26 59.78 58.28 58.76
  • Query decomposition delivers a \sim20 EM increase over Vanilla RAG.
  • Graph-based curation (without GraphRank) yields a 19.9-point EM jump.
  • GraphRank contributes a further modest +0.3 EM.

Comparison with Fine-Tuned SOTA (OTT-QA)

Method EM F1
COS 56.9 63.2
CORE 49.0 55.7
N2N-GQA† 48.80 57.26

Performance matches CORE (49.0 EM) and approaches COS (56.9 EM) while remaining zero-shot.

HybridQA Results

Method EM F1
Vanilla RAG 9.50 14.05
+ Query Decomp 22.00 31.06
N2N-GQA w/o GraphRank 41.00 48.38
N2N-GQA w/ GraphRank 41.50 48.17

6. Ablation, Interpretability, and Error Analysis

  • The most critical improvement comes from graph curation—transforming flat lists into connectivity-aware graphs boosts EM substantially.
  • GraphRank reliably identifies bridge nodes, contributing \sim0.3–1.0 EM.
  • The Bridge-Aware Hybrid Selector marginally improves the integration of passage and table evidence.
  • Error analysis identifies failure modes: evidence not retrieved, intermediate entity ambiguity, limitations inherent in table serialization.

7. Limitations, Future Directions, and Applications

  • Computational Demands: The heavy reliance on large LLM inference is costly. Mitigations may include knowledge distillation or more efficient query planning.
  • Graph Simplicity: The TF-IDF-based edge scheme overlooks richer semantic or relational information. Extension to embedding-based or external knowledge graph links is suggested as future work.
  • GraphRank Impact: The incremental gains from centrality imply other graph metrics (e.g., PageRank, betweenness) or adaptive α\alpha may warrant investigation.
  • Generalization of Paradigm: The noise-to-narrative mechanism is extensible to related tasks—fact verification, multi-document summarization, or knowledge-base construction—requiring reasoning over hybrid and noisy evidence sources.

N2N-GQA demonstrates the efficacy of dynamic evidence graph construction and principled graph-based pruning, achieving robust and interpretable zero-shot multi-hop reasoning across hybrid table-text corpora. Connectivity among evidence, rather than per-document relevance alone, is decisive for assembling effective reasoning chains; simple, transparent graph algorithms are shown to rival complex fine-tuned pipelines (Sharafath et al., 10 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to N2N-GQA.