FinReflectKG MultiHop Benchmark

Updated 8 October 2025

FinReflectKG – MultiHop is a benchmark for multi-hop financial QA that leverages structured, temporally-indexed knowledge graphs to model complex financial evidence retrieval.
It integrates a two-phase pipeline with pattern mining, prompt engineering, and human-in-the-loop quality control to produce high-fidelity financial QA pairs.
Empirical results show significant improvements in accuracy and token efficiency over traditional neural IR methods, highlighting its practical impact on financial AI.

FinReflectKG – MultiHop is a benchmark and experimental platform designed to evaluate and advance multi-hop question answering (QA) over structured, temporally indexed financial knowledge graphs. Grounded in the FinReflectKG dataset—an open-source KG linking audited triples from S&P 100 SEC 10-K filings (2022–2024) to exact evidence chunks—FinReflectKG – MultiHop rigorously tests the role of precise knowledge graph–guided evidence retrieval for complex financial QA at the intersection of retrieval, reasoning, and explainability (Arun et al., 3 Oct 2025).

1. Problem Motivation and Scope

Financial disclosures exhibit an intrinsically multi-hop evidential structure: essential supporting facts are distributed across heterogeneous sections, distinct filings, different fiscal years, and multiple companies. LLM-based QA systems, when fed unstructured or loosely structured text, typically resort to broad retrieval strategies—such as windowed semantic search—that force reasoning models to operate over large, noisy context. This results in excessive token consumption (increasing cost and error) and reduced answer correctness, as key support may be missed or diluted. FinReflectKG – MultiHop targets this retrieval-constraint bottleneck: it quantifies and contrasts the downstream impacts of granular, KG-linked retrieval versus traditional neural IR approaches in demanding, multi-hop financial question answering.

The benchmark leverages temporal indexing to reflect the evolving nature of financial facts and regulatory disclosure standards, and emphasizes the construction of realistic, cross-document, cross-entity, and cross-year queries—thus modeling practical analyst workflows.

2. Benchmark Construction Methodology

FinReflectKG – MultiHop employs a two-phase pipeline for the synthesis of its QA pairs, each tied to explicit multi-hop reasoning over graph-structured evidence.

Pattern Mining and Prompt Engineering: Frequent 2-hop and 3-hop subgraph patterns are identified across the FinReflectKG using sectoral GICS taxonomy to maximize coverage and diversity. Pattern-specific prompts, referencing exact entity and relation types, are employed to generate natural-language questions in the idiom of financial analysts, explicitly encoding temporal constraints and multi-hop compositionality.
Quality Control Validation: QA pairs undergo a multi-criteria human-in-the-loop review, with each instance scored across analyst tone, multi-hop fidelity, evidence grounding, and domain relevance. QA samples not meeting strict thresholds (e.g., a minimum score of 40/50) are excluded, ensuring high quality and faithfulness. All answer sets are directly justified by chains of audited, time-stamped KG triples and their source filing fragments.

Precise linking of each question and answer to supporting KG evidence enables controlled evaluation of retrieval and reasoning stages independently.

3. Controlled Retrieval Scenarios

Three evidence retrieval protocols are defined to isolate the contribution of KG-linked reasoning:

Scenario	Evidence Scope	Token Usage	Retrieval Mode
S1	KG-linked minimal: only relevant fact and context chunks	~2,069	Knowledge Graph
S2	±5 page windows around ground-truth evidence	~13,602	Semantic/Window IR
S3	Windows + random/distractor segments for added noise	↑	Distractor Augment

Models receive, respectively, (S1) only the exact evidence as navigated through the KG, (S2) broader windowed segments as in conventional text IR, and (S3) further noise-augmented contexts simulating real-world unstructured retrieval challenges. This design enables quantification of the gains attributable to fine-grained graph-based retrieval versus the limitations of windowed neural IR.

4. Evaluation Results

Empirical findings, as reported in the dataset tables, indicate:

Correctness (LLM-Judge Score):

For the Qwen3-32B model, correctness in the KG-linked condition (S1) is 8.23, versus 6.59 in the page-window condition (S2).

Token Efficiency:

The KG-linked configuration reduces average input tokens to 2,069 (S1), from 13,602 (S2)—an 84.5% reduction.

Relative Gain Formulas:

$\text{Correctness Improvement} = \frac{\text{KG} - \text{PW}}{\text{PW}} \times 100\% \approx 24\%$

$\text{Token Savings} = \frac{\text{PW} - \text{KG}}{\text{PW}} \times 100\% \approx 84.5\%$

Robustness Across Tasks:

These improvements hold for both LLM-based reasoning and simple retrieval models, highlighting that the dominant error is in evidence selection, not only logical composition.

Scope:

The benchmark includes intra-document, inter-year, and cross-company queries, validating the generality of the multi-hop paradigm.

5. Impact and Broader Implications

The demonstrated superiority of KG-guided evidence retrieval emphasizes the critical role of structured knowledge graphs in multi-hop financial QA. The benchmark confirms that retrieval precision—enabled by time-aware, audited KGs—removes extraneous noise and enables models to focus on composition, not search. This has substantial implications for the design of high-stakes, explainable financial AI systems:

Explainability is enhanced through explicit provenance: every answer is traceable to specific, time-indexed auditable disclosures.
Scalability is improved by reducing token bottlenecks, critical as LLM context windows, though expanding, remain a hard constraint.
Benchmarking is more diagnostic, disentangling retrieval failures from model inference limitations.

A curated subset comprising 555 QA pairs is made available to support further fine-grained research.

FinReflectKG – MultiHop crystallizes lessons from recent work on KGQA and multi-hop reasoning over complex KGs (Yin et al., 2018, Cohen et al., 2019, Lv et al., 2019, Jin et al., 2021, Jiang et al., 2022, Choi et al., 2023, Chakraborty, 30 Apr 2024). While previous general-domain multi-hop QA benchmarks (e.g., MetaQA, ComplexWebQuestions) test large KGs or open-domain corpora, this benchmark is unique in its temporal indexing, auditable evidence chains, and financial domain schema. The empirical findings underscore that typical neural IR (semantic window approaches) is insufficient for the information density and dispersion found in regulatory financial documents.

Furthermore, by providing multiple controlled retrieval scenarios and exact supporting evidence, the benchmark enables precise isolation of retrieval versus reasoning capabilities—a feature not common in prior work.

7. Future Directions

Planned advancements include:

Expanding Dataset Coverage to include more complex cross-sector, cross-instrument, and multi-year scenarios.
Augmenting Automated Evaluation: Integrating additional automated judges and reducing evaluator bias to further validate the deterministic judging protocol as established in FinReflectKG – EvalBench (Dimino et al., 7 Oct 2025).
Integration with Agentic Extraction and Reflection: Building on the agentic construction framework (Arun et al., 25 Aug 2025), future extensions may support iterative correction and reflection in multi-hop evidence retrieval and model reasoning.

Future research may also address compositional generalization, robustness to evolving reporting standards, and coupling FinReflectKG – MultiHop with real-time data feeds for event-driven question answering.

In summary, FinReflectKG – MultiHop constitutes a domain-specific, high-fidelity benchmark for evaluating and driving advances in evidence-linked multi-hop financial question answering. By quantifying the decisive advantages of KG-based retrieval in accuracy and efficiency, it clarifies the retrieval bottleneck and establishes new standards for explainability and robustness in financial AI (Arun et al., 3 Oct 2025).