FinReflectKG-MultiHop Benchmark
- The paper introduces FinReflectKG-MultiHop, a benchmark that leverages SEC 10-K filings to construct a temporally indexed knowledge graph for multi-hop financial QA.
- It employs multi-pass and reflection-agent extraction methods to form composite multi-hop links, significantly boosting precision and recall in QA tasks.
- Structured subgraph pattern mining and controlled retrieval protocols underpin the benchmark’s robust evaluation of financial AI reasoning.
FinReflectKG-MultiHop is a benchmark and methodology for evaluating multi-hop reasoning over financial disclosures, leveraging a temporally indexed and source-attributed knowledge graph constructed from SEC 10-K filings of S&P 100 companies (2022–2024). It addresses the intrinsic challenge of connecting fragmented evidence across documents, companies, and fiscal years by mining and curating analytically significant subgraph patterns that support higher-order question answering. Coupled with agentic extraction algorithms that enable direct multi-hop link formation during knowledge graph construction, FinReflectKG-MultiHop uniquely facilitates efficient and accurate KG-guided financial QA, providing a 555-example dataset and a comprehensive evaluation regime for retrieval and reasoning models (Arun et al., 25 Aug 2025, Arun et al., 3 Oct 2025).
1. Knowledge Graph Construction and Structure
FinReflectKG is defined by a source-attributed, temporally indexed directed graph , where is the set of entities (e.g., organization, financial metric, risk factor), is the set of typed relations, indexes fiscal years, and is the set of audited triples with provenance. Each triple has a head entity , relation , tail entity , fiscal year , and a chunk ID referencing a $4$K-character window from SEC filings. The construction pipeline parses documents, decomposes them into page-/chunk-level units, extracts triples via agentic LLM+rule pipelines, and audits each triple by linking it to its source chunk.
A Neo4j-style index enables fast traversal and provenance resolution, supporting downstream reasoning, retrieval, and pattern mining. This structure directly supports assembly of multi-hop paths by providing ground attributes for each edge and node.
2. Multi-Pass and Reflection-Agent Extraction Mechanisms
Two extraction modes shape FinReflectKG's approach to multi-hop edge formation within financial document KGs:
- Multi-Pass Extraction: For each chunk , an LLM first extracts candidate triples (), then normalizes, merges, and enforces schema on these (). Each pass operates on isolated chunks, growing the graph incrementally, but composite (multi-hop) edges can only emerge through subsequent graph traversal; extraction itself is not contextually multi-hop.
- Reflection-Agent Extraction: Builds on multi-pass but employs a critic-corrector LLM loop. For each chunk, candidate triples are extracted; the critic LLM generates feedback , and the corrector applies fixes iteratively until convergence. Each agent iteration is supplied both the current chunk and its local subgraph, enabling direct proposal of composite edges. For instance, if and exist, the agent will synthesize and add , forming a valid 2-hop edge immediately during extraction.
Formal Definition: For directed graph , a -hop edge between nodes exists if there is a path
and is documented as
where is collapsed to transitive relation types as relevant.
3. Subgraph Pattern Mining and QA Generation
Mining analytically meaningful multi-hop patterns proceeds via the following:
- Frequent Path Mining: Candidate patterns of length are generated by expert-guided LLM (Qwen3-235B-A22B), scrutinized for support () using Cypher queries, and filtered according to a minimum support threshold . Patterns are retained based on analytical value and financial relevance.
- QA Pipeline: From each , path instances with supporting source chunks are used as input to LLM prompts that generate financial analyst-style questions and corresponding answers. Each (Q, A) pair is scored on five criteria (style, multi-hop fidelity, groundedness, relevance, expertise), with pairs retained only if aggregate score . The dataset maintains approximate coverage ratios (52% 2-hop, 48% 3-hop; intra-doc 48.7%, inter-year 41.6%, cross-company 9.7%).
Example:
- 2-hop: Apple’s ESG initiative “Carbon Neutral” disclosed (2023), linked to a positive ROE change, with answer and provenance.
- 3-hop: Supplier chain across Tesla, Model S, Battery, Panasonic drawn from tables and narrative.
4. Controlled Retrieval Regimes and Evaluation Protocols
Each QA-tasked graph path is paired with three evidence regimes for question answering:
- S1: KG-linked Minimal Evidence — exact extracted chunks for each hop.
- S2: Page Window — 5 pages around each chunk, with no distractors.
- S3: Window+Distractor — as S2 with random irrelevant pages to introduce retrieval noise.
Across four LLM architectures (Qwen3-8B/32B, GPT-OSS-20B/120B; Reasoning and Non-Reasoning prompts), models are evaluated for correctness (, assessed by Qwen3-235B as LLM-as-Judge), semantic similarity (BERTScore F1), and token utilization (). Gains are calculated relative to S2:
5. Quantitative Results for Multi-Hop QA
Empirical evaluation (Arun et al., 3 Oct 2025) demonstrates substantial improvement using KG-guided retrieval (S1) over naive page-window retrieval (S2):
| Model | C_{S1} | C_{S2} | Δ_correct | T{in}_{S1} | T{in}_{S2} | Δ_tokens |
|---|---|---|---|---|---|---|
| GPT-OSS-120B | 8.09 | 7.12 | 13.6% | 1,967 | 12,414 | 84.2% |
| GPT-OSS-20B | 7.75 | 6.46 | 20.0% | 1,967 | 12,451 | 84.2% |
| Qwen3-32B | 8.23 | 6.59 | 24.9% | 2,069 | 13,602 | 84.8% |
| Qwen3-8B | 8.03 | 5.77 | 39.2% | 2,069 | 13,601 | 84.8% |
Mean correctness gain across models is approximately 24%, with mean token reduction at 84.5%. Intra-document evidence yields the highest correctness (), cross-company (), and inter-year paths the lowest (), the latter reflecting increased semantic drift and alignment complexity.
6. Impact and Prospects for Multi-Hop Reasoning and Financial AI
FinReflectKG-MultiHop exemplifies rigorous benchmarking for multi-hop financial QA by exploiting knowledge graphs whose auditability, temporal indexing, and fine-grained provenance support efficient chaining of facts. Structured KG-guided retrieval boosts model correctness and decreases input cost (token usage), with relatively small models benefitting more significantly from precise evidence—underscoring the dominance of retrieval quality over pure model scale.
Suggested extensions include expansion to more cross-company and longer-range temporal inquiries, integrating closed-source LLMs as additional evaluators, broader expert audit coverage, and development of hybrid retrievers blending KG and semantic paradigms.
A curated 555 QA-pair benchmark with full provenance is released for community use at https://anonymous.4open.science/r/finreflectkg-multihopqa-BD45/, providing a domain-tailored standard for multi-hop QA, retrieval, and KG-driven reasoning in finance (Arun et al., 3 Oct 2025).
7. Reflection-Driven Extraction and Multi-Hop Discovery
The reflection-agent mode within FinReflectKG demonstrably enhances multi-hop edge formation, surfacing composite relationships during initial extraction. In comparative evaluation (Arun et al., 25 Aug 2025), multi-hop path-oriented metrics favor reflection agents:
| Metric | Multi-Pass | Reflection-Agent |
|---|---|---|
| 2-hop Path Precision | 58.2% | 68.7% |
| 2-hop Path Recall | 52.5% | 62.3% |
| 3-hop Path Precision | 47.3% | 59.3% |
| 3-hop Path Recall | 40.1% | 52.7% |
Reflection feedback enables the critic-corrector agent to synthesize missing intermediate hops and propose new composite edges using graph context, raising 2-hop and 3-hop precisions by 10.5 and 12.0 points, respectively. The formal iterative algorithm, concrete multi-hop SEC 10-K extraction examples, and metric improvements underscore FinReflectKG’s capability for agentic, schema-guided multi-hop link formation directly during extraction, enriching the KG and supporting downstream multi-hop QA.