FinReflectKG-MultiHop Benchmark

Updated 4 January 2026

The paper introduces FinReflectKG-MultiHop, a benchmark that leverages SEC 10-K filings to construct a temporally indexed knowledge graph for multi-hop financial QA.
It employs multi-pass and reflection-agent extraction methods to form composite multi-hop links, significantly boosting precision and recall in QA tasks.
Structured subgraph pattern mining and controlled retrieval protocols underpin the benchmark’s robust evaluation of financial AI reasoning.

FinReflectKG-MultiHop is a benchmark and methodology for evaluating multi-hop reasoning over financial disclosures, leveraging a temporally indexed and source-attributed knowledge graph constructed from SEC 10-K filings of S&P 100 companies (2022–2024). It addresses the intrinsic challenge of connecting fragmented evidence across documents, companies, and fiscal years by mining and curating analytically significant subgraph patterns that support higher-order question answering. Coupled with agentic extraction algorithms that enable direct multi-hop link formation during knowledge graph construction, FinReflectKG-MultiHop uniquely facilitates efficient and accurate KG-guided financial QA, providing a 555-example dataset and a comprehensive evaluation regime for retrieval and reasoning models (Arun et al., 25 Aug 2025, Arun et al., 3 Oct 2025).

1. Knowledge Graph Construction and Structure

FinReflectKG is defined by a source-attributed, temporally indexed directed graph $G=(E, R, T, A)$ , where $E$ is the set of entities (e.g., organization, financial metric, risk factor), $R$ is the set of typed relations, $T$ indexes fiscal years, and $A$ is the set of audited triples with provenance. Each triple $a=(h, r, o, \tau, c)$ has a head entity $h \in E$ , relation $r \in R$ , tail entity $o \in E$ , fiscal year $\tau \in T$ , and a chunk ID $c \in C$ referencing a $4$K-character window from SEC filings. The construction pipeline parses documents, decomposes them into page-/chunk-level units, extracts triples via agentic LLM+rule pipelines, and audits each triple by linking it to its source chunk.

A Neo4j-style index enables fast traversal and provenance resolution, supporting downstream reasoning, retrieval, and pattern mining. This structure directly supports assembly of multi-hop paths by providing ground attributes for each edge and node.

2. Multi-Pass and Reflection-Agent Extraction Mechanisms

Two extraction modes shape FinReflectKG's approach to multi-hop edge formation within financial document KGs:

Multi-Pass Extraction: For each chunk $c$ , an LLM first extracts candidate triples ( $T_c^{(1)}$ ), then normalizes, merges, and enforces schema on these ( $T_c^{(2)}$ ). Each pass operates on isolated chunks, growing the graph incrementally, but composite (multi-hop) edges can only emerge through subsequent graph traversal; extraction itself is not contextually multi-hop.
Reflection-Agent Extraction: Builds on multi-pass but employs a critic-corrector LLM loop. For each chunk, candidate triples $T_c^{(1)}$ are extracted; the critic LLM generates feedback $F_c^{(t)}$ , and the corrector applies fixes iteratively until convergence. Each agent iteration is supplied both the current chunk and its local subgraph, enabling direct proposal of composite edges. For instance, if $(A, \mathrm{Produces}, X)$ and $(X, \mathrm{Impacts}, B)$ exist, the agent will synthesize and add $(A, \mathrm{Impacts}, B)$ , forming a valid 2-hop edge immediately during extraction.

Formal Definition: For directed graph $G = (V, E)$ , a $k$ -hop edge between nodes $u, v$ exists if there is a path

$u = v_0 \xrightarrow{r_1} v_1 \xrightarrow{r_2} \cdots \xrightarrow{r_k} v_k = v$

and is documented as

$E^{(k)} = \{ (u, R, v) \mid \exists v_1, ..., v_{k-1}, R = (r_1, ..., r_k), (v_{i-1}, r_i, v_i) \in E \; \forall i \in [1..k] \}$

where $R$ is collapsed to transitive relation types as relevant.

3. Subgraph Pattern Mining and QA Generation

Mining analytically meaningful multi-hop patterns proceeds via the following:

Frequent Path Mining: Candidate patterns $P$ of length $k \in \{2,3\}$ are generated by expert-guided LLM (Qwen3-235B-A22B), scrutinized for support ( $|{p \in G: p \text{ matches } P}|$ ) using Cypher queries, and filtered according to a minimum support threshold $\theta_k$ . Patterns are retained based on analytical value and financial relevance.
QA Pipeline: From each $P$ , path instances $p=(h, r_1, e_1, ..., r_k, o_k)$ with supporting source chunks ${c_0, ... , c_k}$ are used as input to LLM prompts that generate financial analyst-style questions and corresponding answers. Each (Q, A) pair is scored on five criteria (style, multi-hop fidelity, groundedness, relevance, expertise), with pairs retained only if aggregate score $\geq 40/50$ . The dataset maintains approximate coverage ratios (52% 2-hop, 48% 3-hop; intra-doc 48.7%, inter-year 41.6%, cross-company 9.7%).

Example:

2-hop: Apple’s ESG initiative “Carbon Neutral” disclosed (2023), linked to a positive ROE change, with answer and provenance.
3-hop: Supplier chain across Tesla, Model S, Battery, Panasonic drawn from tables and narrative.

4. Controlled Retrieval Regimes and Evaluation Protocols

Each QA-tasked graph path is paired with three evidence regimes for question answering:

S1: KG-linked Minimal Evidence — exact extracted chunks for each hop.
S2: Page Window — $\pm$ 5 pages around each chunk, with no distractors.
S3: Window+Distractor — as S2 with random irrelevant pages to introduce retrieval noise.

Across four LLM architectures (Qwen3-8B/32B, GPT-OSS-20B/120B; Reasoning and Non-Reasoning prompts), models are evaluated for correctness ( $C_{m,s} \in [0,10]$ , assessed by Qwen3-235B as LLM-as-Judge), semantic similarity (BERTScore F1), and token utilization ( $T_{m,s}^{in/out}$ ). Gains are calculated relative to S2:

$\Delta_{correct} = (C_{m,S1} - C_{m,S2}) / C_{m,S2} \times 100\%$
$\Delta_{tokens} = (T_{m,S2}^{in} - T_{m,S1}^{in}) / T_{m,S2}^{in} \times 100\%$

5. Quantitative Results for Multi-Hop QA

Empirical evaluation (Arun et al., 3 Oct 2025) demonstrates substantial improvement using KG-guided retrieval (S1) over naive page-window retrieval (S2):

Model	C_{S1}	C_{S2}	Δ_correct	T^{in}_{S1}	T^{in}_{S2}	Δ_tokens
GPT-OSS-120B	8.09	7.12	13.6%	1,967	12,414	84.2%
GPT-OSS-20B	7.75	6.46	20.0%	1,967	12,451	84.2%
Qwen3-32B	8.23	6.59	24.9%	2,069	13,602	84.8%
Qwen3-8B	8.03	5.77	39.2%	2,069	13,601	84.8%

Mean correctness gain across models is approximately 24%, with mean token reduction at 84.5%. Intra-document evidence yields the highest correctness ( $C \approx 7.47$ ), cross-company ( $C \approx 7.12$ ), and inter-year paths the lowest ( $C \approx 6.72$ ), the latter reflecting increased semantic drift and alignment complexity.

6. Impact and Prospects for Multi-Hop Reasoning and Financial AI

FinReflectKG-MultiHop exemplifies rigorous benchmarking for multi-hop financial QA by exploiting knowledge graphs whose auditability, temporal indexing, and fine-grained provenance support efficient chaining of facts. Structured KG-guided retrieval boosts model correctness and decreases input cost (token usage), with relatively small models benefitting more significantly from precise evidence—underscoring the dominance of retrieval quality over pure model scale.

Suggested extensions include expansion to more cross-company and longer-range temporal inquiries, integrating closed-source LLMs as additional evaluators, broader expert audit coverage, and development of hybrid retrievers blending KG and semantic paradigms.

A curated 555 QA-pair benchmark with full provenance is released for community use at https://anonymous.4open.science/r/finreflectkg-multihopqa-BD45/, providing a domain-tailored standard for multi-hop QA, retrieval, and KG-driven reasoning in finance (Arun et al., 3 Oct 2025).

7. Reflection-Driven Extraction and Multi-Hop Discovery

The reflection-agent mode within FinReflectKG demonstrably enhances multi-hop edge formation, surfacing composite relationships during initial extraction. In comparative evaluation (Arun et al., 25 Aug 2025), multi-hop path-oriented metrics favor reflection agents:

Metric	Multi-Pass	Reflection-Agent
2-hop Path Precision	58.2%	68.7%
2-hop Path Recall	52.5%	62.3%
3-hop Path Precision	47.3%	59.3%
3-hop Path Recall	40.1%	52.7%

Reflection feedback enables the critic-corrector agent to synthesize missing intermediate hops and propose new composite edges using graph context, raising 2-hop and 3-hop precisions by 10.5 and 12.0 points, respectively. The formal iterative algorithm, concrete multi-hop SEC 10-K extraction examples, and metric improvements underscore FinReflectKG’s capability for agentic, schema-guided multi-hop link formation directly during extraction, enriching the KG and supporting downstream multi-hop QA.

PDF Markdown Chat (Pro)

References (2)

FinReflectKG: Agentic Construction and Evaluation of Financial Knowledge Graphs (2025)

FinReflectKG - MultiHop: Financial QA Benchmark for Reasoning with Knowledge Graph Evidence (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to FinReflectKG-MultiHop.

FinReflectKG-MultiHop Benchmark

1. Knowledge Graph Construction and Structure

2. Multi-Pass and Reflection-Agent Extraction Mechanisms

3. Subgraph Pattern Mining and QA Generation

4. Controlled Retrieval Regimes and Evaluation Protocols

5. Quantitative Results for Multi-Hop QA

6. Impact and Prospects for Multi-Hop Reasoning and Financial AI

7. Reflection-Driven Extraction and Multi-Hop Discovery

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

FinReflectKG-MultiHop Benchmark

1. Knowledge Graph Construction and Structure

2. Multi-Pass and Reflection-Agent Extraction Mechanisms

3. Subgraph Pattern Mining and QA Generation

4. Controlled Retrieval Regimes and Evaluation Protocols

5. Quantitative Results for Multi-Hop QA

6. Impact and Prospects for Multi-Hop Reasoning and Financial AI

7. Reflection-Driven Extraction and Multi-Hop Discovery

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research