Papers
Topics
Authors
Recent
2000 character limit reached

FinReflectKG – MultiHop Benchmark

Updated 25 November 2025
  • FinReflectKG – MultiHop is a benchmark that advances multi-hop QA by leveraging a temporally-indexed financial knowledge graph extracted from S&P 100 filings.
  • It employs LLM-prompted subgraph pattern mining and a rigorous two-phase pipeline to generate and validate evidence-grounded QA pairs.
  • The benchmark demonstrates significant improvements in correctness and resource efficiency, supporting scalable, provenance-aware financial question answering.

FinReflectKG – MultiHop is a benchmark and methodology suite for evaluating multi-hop question answering (QA) over financial disclosures, leveraging a temporally indexed knowledge graph (KG) constructed from S&P 100 companies’ SEC 10-K filings for 2022–2024. The resource addresses the central challenge in financial QA: relevant facts are dispersed across filings, sections, temporal spans, and companies, and LLMs often fail to retrieve or connect the correct context for robust multi-hop reasoning. FinReflectKG – MultiHop systematically evaluates retrieval and reasoning under controlled evidence regimes and establishes a path for scalable, provenance-aware financial question answering grounded in knowledge graphs (Arun et al., 3 Oct 2025).

1. Underlying Knowledge Graph: Temporally Indexed FinReflectKG

FinReflectKG – MultiHop is built upon the FinReflectKG, a source-attributed, temporally indexed financial knowledge graph extracted from S&P 100 10-K filings for 2022–2024. Each triple in the KG has the following properties:

  • Provenance Attribution: Every RDF-style triple (⟨subject, predicate, object⟩) is linked to a precise document chunk (file, page, chunk ID) in the underlying filing, ensuring every fact is auditable.
  • Temporal Indexing: Each triple is timestamped to correspond with the specific annual filing, supporting temporal queries such as event or metric changes across reporting years.
  • Representative Subgraph: While the main FinReflectKG spans 17.5M triples from 743 S&P 500 companies over 2014–2024, the MultiHop benchmark focuses on a concise, high-density subgraph (S&P 100, 2022–2024) to facilitate controlled, evaluable multi-hop QA (Arun et al., 3 Oct 2025).

This design enables downstream tasks to constrain retrieval to disclosures extant at a specific point in time and to analyze the evolution of relationships intra- and inter-annually or across firms.

2. Multi-Hop Subgraph Pattern Mining and Question Generation

To mirror real-world financial analytical queries, the benchmark mines frequent 2- and 3-hop subgraph patterns, stratified by GICS sector taxonomy:

  • Pattern Generation via LLM Prompting: A high-capacity LLM (Qwen3-235B) is prompted using a schema of 24 entity types (e.g., ORG, FIN_METRIC, RISK_FACTOR) and 29 relation types (e.g., Discloses, Depends_On, Positively_Impacts). Prompts emphasize financial relevance, analytical value, and Cypher query compatibility.
    • Each generated subgraph template is evaluated on four criteria: correct terminology, multi-hop complexity, uniqueness, and validity (total score ≥8/10 for inclusion).
  • Graph-Based Filtering: Validated patterns are instantiated over the KG, with candidate evidence chunks scored by pattern density D(P,c)D(P, c) (number of matched triples per chunk) and ranked by betweenness centrality BC(c)BC(c) to ensure chunks serve as bridging nodes in multi-hop chains. Chunks with highest density and centrality seed question generation.

The resulting subgraphs underpin the generation of analyst-style questions, capturing sector-specific reasoning structures and grounding each query in precise, temporally anchored KG evidence (Arun et al., 3 Oct 2025).

3. QA Pair Generation and Quality Assurance Pipeline

FinReflectKG – MultiHop employs a rigorous two-phase pipeline for QA pair creation:

  • Phase 1: Pattern-Specific Prompting. Each mined KG path is transformed into a conversational, temporally explicit financial question (e.g., "In its 2023 report, which raw material does Company X disclose as being impacted by Event Y?"). The prompt extracts both the question and the answer, which is drawn verbatim from the supporting KG evidence.
  • Phase 2: Multi-Criteria Quality Control. Each QA pair is scored on five axes—analyst-likeness, multi-hop fidelity, evidence grounding, relevance, and domain expertise—each up to 10 points. Only pairs with i=15si40\sum_{i=1}^{5} s_i \geq 40 are retained, enforcing stringent validity and relevance requirements.

A curated set of 555 expert-validated QA pairs is released, facilitating robust experimental evaluation and supporting further research in financial QA (Arun et al., 3 Oct 2025).

4. Controlled Evaluation Scenarios and Experimental Design

The benchmark evaluates retrieval and reasoning under three evidence settings, simulating practical deployment scenarios:

Scenario Evidence Provided Purpose
S1: KG-linked Only the exact interconnected chunks identified by the KG path Minimal context, maximal evidence precision
S2: Page-window ±5 pages around relevant chunks (deduplicated, source-tagged) High-precision vector retrieval approximation
S3: Windows + Distractors Same as S2, plus randomly selected irrelevant/semantically similar chunks Noisy, low-precision retrieval simulation

Models evaluated include OpenAI GPT-OSS (20B/120B) and Qwen3 (8B/32B), in reasoning (chain-of-thought) and non-reasoning configurations. Metrics tracked comprise:

  • Correctness: LLM-as-a-Judge (Qwen3-235B or Gemini 2.5 Pro) rates model outputs (0–10).
  • Semantic Fidelity: BERTScore F1 computed using microsoft/deberta-xlarge-mnli.
  • Resource Utilization: Input and completion tokens, reflecting efficiency and context window usage (Arun et al., 3 Oct 2025).

5. Quantitative Results and Performance Analysis

Across models and evidence regimes, KG-linked retrieval (S1) delivers substantial performance gains:

  • Correctness: KG-linked evidence boosts scores by approximately 24% over page-window context (e.g., Qwen3-32B achieves 8.23 vs. 6.59; +24.9%).
  • Resource Efficiency: Input tokens required are reduced by ~84.5% (Qwen3-32B: 2,069 vs. 13,602; –84.8%).
  • Evidence Relationship Breakdown: Average LLM-Judge correctness for intra-document: 7.47, cross-company: 7.12, inter-year: 6.72.

These findings corroborate that precise, provenance-linked KG evidence is especially advantageous for questions spanning documents, years, or companies, and that even strong LLMs incur significant degradation with noisy, lengthy input contexts (Arun et al., 3 Oct 2025).

6. Positioning within Multi-Hop KGQA and Relevant Methodologies

FinReflectKG – MultiHop is architecturally distinguished from prior benchmarks by integrating symbolic retrieval with temporal and provenance constraints, while its evaluation design draws on leading approaches from general KGQA and retrieval-augmented generation:

  • Multi-View Retrieval-Augmented Generation: The ParallaxRAG framework (Liu, 17 Oct 2025) demonstrates that multi-hop reasoning can be enhanced via multiple semantic views (attention heads), diversity regularization, and head-wise hop specialization. These techniques, when adapted for financial KGs, facilitate interpretable, stepwise analysis and risk-aware path ranking.
  • Unified Retrieval-Reasoning Architectures: The UniKGQA approach (Jiang et al., 2022) advocates for joint retrieval and reasoning using PLM-encoded question–relation pairs and information propagation, which can directly be instantiated for financial relations and entities in FinReflectKG – MultiHop. Shared pre-training on question–relation contrastive tasks and abstract graph retrieval are directly applicable to accelerate evidence gathering and reasoning in the financial domain.

These connections underscore that symbolic, provenance-centric retrieval and multi-hop reasoning are converging as critical techniques for high-precision, low-noise question answering in complex domains.

7. Limitations, Ongoing Work, and Research Directions

FinReflectKG – MultiHop's current evaluation is sector-limited (Financials and Tech) and offers a moderate-sized, expert-validated subset. The need for broader GICS sector coverage, scaled manual review for grounding, and increased diversity in LLM judges is acknowledged. Future work aims to expand the benchmark scope, mitigate evaluator bias, and enhance expert auditing, supporting the development of scalable, transparent financial multi-hop reasoning systems (Arun et al., 3 Oct 2025).

The pivotal conclusion is that structured, temporally and provenance-aware KG evidence can both improve multi-hop QA correctness (by ≃ 24%) and dramatically reduce token overhead (by ≃ 84.5%), providing a robust foundation for advanced financial QA pipelines in both academic and practical settings.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FinReflectKG – MultiHop.