NovelHopQA: Multi-Hop Long-Context QA Benchmark

Updated 4 January 2026

The paper introduces a diagnostic benchmark that systematically tests multi-hop reasoning in long-context narratives extracted from public-domain novels.
Empirical results show LLM performance degradation with increased hop depth and context length, highlighting issues like entity confusion and incomplete evidence integration.
Controlled chain construction and novel evaluation protocols in NovelHopQA provide actionable insights for developing improved retrieval-augmented, hop-aware QA methodologies.

NovelHopQA is a diagnostic benchmark designed to evaluate and analyze multi-hop reasoning capabilities in long-context question answering (QA) systems, specifically within full-length narrative contexts spanning 64,000–128,000 tokens. Unlike prior benchmarks that separately assess long-context comprehension or multi-hop reasoning in synthetic or short Wikipedia snippets, NovelHopQA systematically varies both context length and reasoning hop depth (1–4 hops) within authentic, contiguous storylines. Its construction, evaluation, and subsequent experimental analysis offer controlled insights into the limitations of current LLMs, highlighting characteristic failure modes and informing the development of retrieval-augmented and hop-aware QA methodologies (Gupta et al., 20 May 2025).

1. Conceptual Foundation and Motivation

NovelHopQA addresses the intersection of two previously isolated axes in QA research: processing extended textual context and performing multi-hop sequential inference. Existing datasets such as HotpotQA and WikiHop focus on two-hop reasoning within short documents, while others like NarrativeQA or LV-Eval test long-form comprehension but restrict themselves to single-hop or summary-style questions. NovelHopQA was developed in response to the absence of benchmarks probing both extended context lengths (≥64k tokens) and increased reasoning depth (1–4 hops) within coherent narratives, using public-domain novels as the source texts (Gupta et al., 20 May 2025). The motivation is to diagnose accuracy degradation as both context window and number of reasoning steps grow, and to expose systematic failure modes in state-of-the-art models.

2. Dataset Construction and Instances

The dataset comprises 4,000 QA instances, stratified into approximately 1,000 examples per hop level, sampled from 64k, 96k, and 128k-token excerpts of 83 English public-domain novels (Project Gutenberg). The construction pipeline consists of anchor-keyword selection (min. 50 occurrences per anchor), paragraph pool creation (≥30 words per paragraph), and guided multi-hop chain building. For each example, hop chains H = {h₁, h₂, ..., h_H} are constructed by selecting paragraphs via keyword-driven filtering and semantic clue expansion:

For the initial hop, the system selects a paragraph containing a designated anchor keyword and generates an answerable question.
For subsequent hops, context is expanded via semantically novel keywords and additional paragraphs, mandating answer integration across all selected evidence.

The following algorithmic sketch details the iterative chain construction:

for each novel in corpus:
  anchors ← select_anchors(novel)  # ≥5 high-frequency keywords
  pool    ← split_into_paragraphs(novel)
  for target_hop_count H in {1..4}:
    C₀ ← ∅
    for h in 1..H:
      if h == 1:
        k₁ ← anchors[1]
        p₁ ← f_select({k₁}, pool)
        C₁ ← p₁
        (Q₁,A₁) ← GenQA(C₁)
      else:
        kₕ ← f_extract_keyword(C_{h-1})
        pₕ ← f_select({k₁,kₕ}, pool)
        Cₕ ← C_{h-1} ∥ pₕ
        (Qₕ,Aₕ) ← GenQA(Cₕ)
      remove pₕ from pool

Subsequent oracle-filtering ensures that only questions answerable using the available evidence are retained, validated by human annotators for both alignment and required reasoning depth (Gupta et al., 20 May 2025).

3. Evaluation and Metrics

Seven state-of-the-art LLMs were assessed, including OpenAI o1, multiple GPT-4o variants, LLaMA-3.3-70B-Instruct, and Google Gemini 2.5 Pro/Flash models. Standard chain-of-thought prompting was employed. Two evaluation regimes were applied:

Full-context inference (golden context).
Retrieval-augmented generation (RAG): novels are chunked into 350-token segments; top-k (k=7) chunks retrieved by bge-large-en embeddings and FAISS, yielding ∼2.5k tokens concatenated for model input.

Metrics used:

Exact Match (EM): $EM = 1[â = a^*]$
Token-level F1: $F1 = 2 \cdot \text{precision} \cdot \text{recall} / (\text{precision} + \text{recall})$

4. Empirical Results and Failure Modes

Accuracy Trends

The relationship between hop depth and context length yields consistent degradation in model performance:

Context (tokens)	H=1	H=2	H=3	H=4
64k	86.4%	79.5%	77.1%	73.8%
96k	83.5%	76.5%	74.7%	71.9%
128k	81.7%	73.5%	72.0%	68.9%

Top-performing models (e.g., Gemini 2.5 Pro, o1) achieve EM >95% on golden context but fall to ∼60% in RAG, while mid-tier models show more pronounced drops. The average EM decreases by 12.6 points from H=1 to H=4 in 64k contexts; RAG-based inference yields additional drops of 25–35 points, underscoring the brittleness of current retrieval techniques.

Characteristic Failure Modes

Four dominant error classes were documented, each illustrated via concrete narrative examples:

Missing Final-Hop Integration: Sequential clues are processed correctly up to H−1 but final-hop evidence is ignored.
Entity Confusion/Coreference Errors: Model misattributes actions or facts to incorrect entities, usually in later hops.
Incomplete Evidence Combination: Only a subset of required facts is reported, typically omitting the newest evidence.
Contextual Drift: Model returns to irrelevant details from earlier hops rather than maintaining focus on the necessary chain.

5. Methodological Innovations and Adaptation

NovelHopQA motivates hop-aware retrieval and QA pipeline strategies, such as iterative expand-and-retrieve or multi-hop dense retrievers with posterior regularization. Momentum Posterior Regularization (MoPo) (Xia et al., 2024) is of particular relevance:

MoPo framework: Uses query-focus summaries for each hop rather than raw answers, and distills posterior retrieval knowledge into the prior via a momentum-moving average. This stabilizes learning by keeping posterior parameters φ close to the evolving prior θ during training.
Adaptation protocol: NovelHopQA instances are serialized into “PostSum” chains, enabling training of dual-encoder retrievers (E5/Contriever) before inference proceeds with iterative summary-guided retrieval, discarding the posterior retriever at inference.

MoPo yielded state-of-the-art retrieval and QA results on HotpotQA, with R@2 = 94.8% and EM@2 = 63.0%, outperforming fixed/dynamic two-stage posterior regularization and naive concatenation (Xia et al., 2024). In principle, these methods transfer directly to NovelHopQA via backward summary generation, plug-and-play encoder initialization, and single-stage momentum distillation.

6. Comparative Methodologies: Hop-Agnostic and Graph-Based Approaches

Unified frameworks such as Iterative Document Reranking (IDRQA) (Nie et al., 2020) demonstrate hop-agnostic, adaptive pipelines valid for HotpotQA and extensible in theory to NovelHopQA:

Retrieval and Question Expansion: Subsequent hops employ a Question Updater to refine query by concatenating extracted clue spans, enabling multi-hop retrieval without pre-specifying the number of steps.
Graph-Based Reranking: Multi-document interaction is achieved via a graph of entity nodes, with co-occurrence adjacency and contextual features aggregated in a Graph Attention Network, facilitating bridge reasoning.
Filtering and Adaptive Stopping: Paragraph-level scores enable dynamic filtering, and termination is controlled by a no-answer probability prediction.

Table of performance (as reported):

Dataset	EM	F1
SQuAD Open	56.6%	56.6%
HotpotQA (joint)	36.0%	63.9%

Ablation studies confirm graph-based reranking and iterative retrieval as necessary for strong multi-hop performance. Limitations include reduced robustness to numerically or structurally complex questions, and support-fact prediction is still decoupled from span selection, suggesting further joint modeling is required in NovelHopQA-like domains (Nie et al., 2020).

7. Consequences, Recommendations, and Open Problems

The empirical and methodological analyses of NovelHopQA lead to several key insights:

Scaling context windows alone does not mitigate multi-hop degradation; accuracy declines systematically as hop count increases, irrespective of absolute context size.
Standard RAG protocols cannot reliably recover multi-hop evidence or narrative order, especially above 64k tokens; more sophisticated hop-aware retrieval and iterative summary generation are recommended.
Persistent failure modes indicate gaps in both evidence combination across hops and coreference tracking, implying the need for chain-of-thought-aligned architectures.
Future research directions include scalable models that maintain coherent inference chains over ≥100k tokens, retrieval-conditioned text generation that guarantees all hops are covered, and benchmarks diversified to non-narrative technical domains.

NovelHopQA, with its open-source code, data, templates, and extensive validation, establishes a reproducible standard for diagnosing, benchmarking, and ultimately improving long-context multi-hop question answering in realistic narrative settings (Gupta et al., 20 May 2025).