Papers
Topics
Authors
Recent
2000 character limit reached

FinReflectKG-MultiHop Benchmark

Updated 4 January 2026
  • The paper introduces FinReflectKG-MultiHop, a benchmark that leverages SEC 10-K filings to construct a temporally indexed knowledge graph for multi-hop financial QA.
  • It employs multi-pass and reflection-agent extraction methods to form composite multi-hop links, significantly boosting precision and recall in QA tasks.
  • Structured subgraph pattern mining and controlled retrieval protocols underpin the benchmark’s robust evaluation of financial AI reasoning.

FinReflectKG-MultiHop is a benchmark and methodology for evaluating multi-hop reasoning over financial disclosures, leveraging a temporally indexed and source-attributed knowledge graph constructed from SEC 10-K filings of S&P 100 companies (2022–2024). It addresses the intrinsic challenge of connecting fragmented evidence across documents, companies, and fiscal years by mining and curating analytically significant subgraph patterns that support higher-order question answering. Coupled with agentic extraction algorithms that enable direct multi-hop link formation during knowledge graph construction, FinReflectKG-MultiHop uniquely facilitates efficient and accurate KG-guided financial QA, providing a 555-example dataset and a comprehensive evaluation regime for retrieval and reasoning models (Arun et al., 25 Aug 2025, Arun et al., 3 Oct 2025).

1. Knowledge Graph Construction and Structure

FinReflectKG is defined by a source-attributed, temporally indexed directed graph G=(E,R,T,A)G=(E, R, T, A), where EE is the set of entities (e.g., organization, financial metric, risk factor), RR is the set of typed relations, TT indexes fiscal years, and AA is the set of audited triples with provenance. Each triple a=(h,r,o,τ,c)a=(h, r, o, \tau, c) has a head entity hEh \in E, relation rRr \in R, tail entity oEo \in E, fiscal year τT\tau \in T, and a chunk ID cCc \in C referencing a $4$K-character window from SEC filings. The construction pipeline parses documents, decomposes them into page-/chunk-level units, extracts triples via agentic LLM+rule pipelines, and audits each triple by linking it to its source chunk.

A Neo4j-style index enables fast traversal and provenance resolution, supporting downstream reasoning, retrieval, and pattern mining. This structure directly supports assembly of multi-hop paths by providing ground attributes for each edge and node.

2. Multi-Pass and Reflection-Agent Extraction Mechanisms

Two extraction modes shape FinReflectKG's approach to multi-hop edge formation within financial document KGs:

  • Multi-Pass Extraction: For each chunk cc, an LLM first extracts candidate triples (Tc(1)T_c^{(1)}), then normalizes, merges, and enforces schema on these (Tc(2)T_c^{(2)}). Each pass operates on isolated chunks, growing the graph incrementally, but composite (multi-hop) edges can only emerge through subsequent graph traversal; extraction itself is not contextually multi-hop.
  • Reflection-Agent Extraction: Builds on multi-pass but employs a critic-corrector LLM loop. For each chunk, candidate triples Tc(1)T_c^{(1)} are extracted; the critic LLM generates feedback Fc(t)F_c^{(t)}, and the corrector applies fixes iteratively until convergence. Each agent iteration is supplied both the current chunk and its local subgraph, enabling direct proposal of composite edges. For instance, if (A,Produces,X)(A, \mathrm{Produces}, X) and (X,Impacts,B)(X, \mathrm{Impacts}, B) exist, the agent will synthesize and add (A,Impacts,B)(A, \mathrm{Impacts}, B), forming a valid 2-hop edge immediately during extraction.

Formal Definition: For directed graph G=(V,E)G = (V, E), a kk-hop edge between nodes u,vu, v exists if there is a path

u=v0r1v1r2rkvk=vu = v_0 \xrightarrow{r_1} v_1 \xrightarrow{r_2} \cdots \xrightarrow{r_k} v_k = v

and is documented as

E(k)={(u,R,v)v1,...,vk1,R=(r1,...,rk),(vi1,ri,vi)E  i[1..k]}E^{(k)} = \{ (u, R, v) \mid \exists v_1, ..., v_{k-1}, R = (r_1, ..., r_k), (v_{i-1}, r_i, v_i) \in E \; \forall i \in [1..k] \}

where RR is collapsed to transitive relation types as relevant.

3. Subgraph Pattern Mining and QA Generation

Mining analytically meaningful multi-hop patterns proceeds via the following:

  • Frequent Path Mining: Candidate patterns PP of length k{2,3}k \in \{2,3\} are generated by expert-guided LLM (Qwen3-235B-A22B), scrutinized for support (pG:p matches P|{p \in G: p \text{ matches } P}|) using Cypher queries, and filtered according to a minimum support threshold θk\theta_k. Patterns are retained based on analytical value and financial relevance.
  • QA Pipeline: From each PP, path instances p=(h,r1,e1,...,rk,ok)p=(h, r_1, e_1, ..., r_k, o_k) with supporting source chunks c0,...,ck{c_0, ... , c_k} are used as input to LLM prompts that generate financial analyst-style questions and corresponding answers. Each (Q, A) pair is scored on five criteria (style, multi-hop fidelity, groundedness, relevance, expertise), with pairs retained only if aggregate score 40/50\geq 40/50. The dataset maintains approximate coverage ratios (52% 2-hop, 48% 3-hop; intra-doc 48.7%, inter-year 41.6%, cross-company 9.7%).

Example:

  • 2-hop: Apple’s ESG initiative “Carbon Neutral” disclosed (2023), linked to a positive ROE change, with answer and provenance.
  • 3-hop: Supplier chain across Tesla, Model S, Battery, Panasonic drawn from tables and narrative.

4. Controlled Retrieval Regimes and Evaluation Protocols

Each QA-tasked graph path is paired with three evidence regimes for question answering:

  • S1: KG-linked Minimal Evidence — exact extracted chunks for each hop.
  • S2: Page Window±\pm5 pages around each chunk, with no distractors.
  • S3: Window+Distractor — as S2 with random irrelevant pages to introduce retrieval noise.

Across four LLM architectures (Qwen3-8B/32B, GPT-OSS-20B/120B; Reasoning and Non-Reasoning prompts), models are evaluated for correctness (Cm,s[0,10]C_{m,s} \in [0,10], assessed by Qwen3-235B as LLM-as-Judge), semantic similarity (BERTScore F1), and token utilization (Tm,sin/outT_{m,s}^{in/out}). Gains are calculated relative to S2:

  • Δcorrect=(Cm,S1Cm,S2)/Cm,S2×100%\Delta_{correct} = (C_{m,S1} - C_{m,S2}) / C_{m,S2} \times 100\%
  • Δtokens=(Tm,S2inTm,S1in)/Tm,S2in×100%\Delta_{tokens} = (T_{m,S2}^{in} - T_{m,S1}^{in}) / T_{m,S2}^{in} \times 100\%

5. Quantitative Results for Multi-Hop QA

Empirical evaluation (Arun et al., 3 Oct 2025) demonstrates substantial improvement using KG-guided retrieval (S1) over naive page-window retrieval (S2):

Model C_{S1} C_{S2} Δ_correct T{in}_{S1} T{in}_{S2} Δ_tokens
GPT-OSS-120B 8.09 7.12 13.6% 1,967 12,414 84.2%
GPT-OSS-20B 7.75 6.46 20.0% 1,967 12,451 84.2%
Qwen3-32B 8.23 6.59 24.9% 2,069 13,602 84.8%
Qwen3-8B 8.03 5.77 39.2% 2,069 13,601 84.8%

Mean correctness gain across models is approximately 24%, with mean token reduction at 84.5%. Intra-document evidence yields the highest correctness (C7.47C \approx 7.47), cross-company (C7.12C \approx 7.12), and inter-year paths the lowest (C6.72C \approx 6.72), the latter reflecting increased semantic drift and alignment complexity.

6. Impact and Prospects for Multi-Hop Reasoning and Financial AI

FinReflectKG-MultiHop exemplifies rigorous benchmarking for multi-hop financial QA by exploiting knowledge graphs whose auditability, temporal indexing, and fine-grained provenance support efficient chaining of facts. Structured KG-guided retrieval boosts model correctness and decreases input cost (token usage), with relatively small models benefitting more significantly from precise evidence—underscoring the dominance of retrieval quality over pure model scale.

Suggested extensions include expansion to more cross-company and longer-range temporal inquiries, integrating closed-source LLMs as additional evaluators, broader expert audit coverage, and development of hybrid retrievers blending KG and semantic paradigms.

A curated 555 QA-pair benchmark with full provenance is released for community use at https://anonymous.4open.science/r/finreflectkg-multihopqa-BD45/, providing a domain-tailored standard for multi-hop QA, retrieval, and KG-driven reasoning in finance (Arun et al., 3 Oct 2025).

7. Reflection-Driven Extraction and Multi-Hop Discovery

The reflection-agent mode within FinReflectKG demonstrably enhances multi-hop edge formation, surfacing composite relationships during initial extraction. In comparative evaluation (Arun et al., 25 Aug 2025), multi-hop path-oriented metrics favor reflection agents:

Metric Multi-Pass Reflection-Agent
2-hop Path Precision 58.2% 68.7%
2-hop Path Recall 52.5% 62.3%
3-hop Path Precision 47.3% 59.3%
3-hop Path Recall 40.1% 52.7%

Reflection feedback enables the critic-corrector agent to synthesize missing intermediate hops and propose new composite edges using graph context, raising 2-hop and 3-hop precisions by 10.5 and 12.0 points, respectively. The formal iterative algorithm, concrete multi-hop SEC 10-K extraction examples, and metric improvements underscore FinReflectKG’s capability for agentic, schema-guided multi-hop link formation directly during extraction, enriching the KG and supporting downstream multi-hop QA.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to FinReflectKG-MultiHop.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube