FinReflectKG – EvalBench: Financial KG Benchmark
- The framework establishes a deterministic commit-then-justify protocol that rigorously links extracted KG triples to their specific source text for enhanced faithfulness.
- It employs three extraction modes—single-pass, multi-pass, and reflection-agent-based—to balance trade-offs between precision, comprehensiveness, and textual adherence.
- The benchmark aggregates bias-controlled metrics and supports structured error analysis, advancing transparent, reliable financial AI applications.
FinReflectKG – EvalBench is a multi-dimensional evaluation framework and benchmark for financial knowledge graph (KG) extraction from SEC 10-K filings, specifically tailored to rigorously assess and compare structured triple extraction methods with explicit bias controls and agentic linking of triples to source text. By implementing deterministic commit-then-justify protocols and comprehensive metric aggregation, EvalBench provides a reproducible, fine-grained, and bias-aware methodology, advancing transparency, coverage, and factuality standards in financial AI applications.
1. Design Principles and Architecture
FinReflectKG – EvalBench is founded on agentic extraction principles, rigorously linking every KG triple to its specific source chunk in SEC 10-K filings. The framework supports three distinct extraction modes: single-pass (direct, one-shot extraction and normalization); multi-pass (initial extraction followed by normalization in a separate step); and reflection-agent-based (iterative correction with feedback loops).
Evaluation is structured across multiple dimensions using a deterministic commit-then-justify protocol, where an LLM-as-Judge is configured at temperature 0.0, first issuing binary verdicts (yes/no) per dimension, then providing concise, bounded-length justifications. The architecture is designed to eliminate the influence of ambiguity, external/world knowledge, ordering effects, and verbosity on verdicts by embedding explicit bias-mitigation controls throughout both extraction and evaluation stages.
2. Multi-Dimensional Judging Protocol and Bias Controls
Central to EvalBench is the judging protocol, which encompasses the following dimensions:
- Faithfulness evaluates if the triple is directly and factually grounded in the source chunk, without relying on inferential or background knowledge.
- Precision penalizes generic or underspecified entities, requiring triples to reflect clearly demarcated sets and relationships as stated in the text.
- Relevance ensures all extracted triples are thematically pertinent to their corresponding source chunk, avoiding tangential or unrelated facts.
- Comprehensiveness is scored at the chunk level on a three-level ordinal scale (“good”, “partial”, “bad”), reflecting the coverage of atomic facts.
Bias controls are rigorously enforced: leniency bias is mitigated by defaulting to the negative verdict under ambiguity; world-knowledge bias is controlled by limiting judgments strictly to the evidence in the prompt; position bias is addressed by instructing judges to ignore sentence order, and verbosity bias by disregarding surface-form variations that do not affect semantic accuracy. Calibrating the judge with few-shot examples further improves reliability and consistency.
3. Performance Metrics and Aggregation
Each candidate triple is assessed independently for faithfulness, precision, and relevance with a binary indicator (1 for satisfied, 0 for not possible). These metrics are micro-averaged over all triples in the corpus. Comprehensiveness is macro-averaged across all chunks. Formally, binary metrics are computed as:
with analogous equations for precision () and relevance (). Comprehensiveness is aggregated as:
where , the extraction mapping for each span .
4. Comparative Analysis of Extraction Modes
EvalBench supports three modes, each exhibiting distinct performance trade-offs:
- Single-Pass Extraction: Highest faithfulness ($87.25$), yielding triples closely tied to text, but often less comprehensive.
- Multi-Pass Extraction: Balances faithfulness and precision via staged normalization but does not notably surpass single-pass in aggregate comprehensiveness.
- Reflection-Agent-Based Extraction: Excels in comprehensiveness ($72.01$), precision ($59.49$), and relevance ($92.52$), employing iterative feedback to correct and expand extraction. Reflection comes with a slight decrease in faithfulness compared to single-pass, highlighting a trade-off between breadth and textual grounding.
The iterative, agentic reflection workflow is particularly effective for expanding coverage and improving dimensional performance, especially in capturing diverse and interrelated facts from complex financial narratives.
5. Structured Error Analysis and Reliability
The explicit dimensional partitioning enables structured error analysis and facilitates tracing systematic weaknesses in extraction protocols. When equipped with explicit bias controls, LLM-as-Judge protocols emerge as reliable, cost-efficient substitutes for human annotation, providing granular visibility into extraction errors and logical inconsistencies at scale.
Reflection-based extraction exposes broader semantic coverage and successfully mitigates hallucination (precision/relevance errors), while single-pass preserves strict adherence to source text. This suggests that application-dependent prioritization of metrics is advisable—for example, risk management might demand maximal faithfulness, while market intelligence may prioritize comprehensiveness and relevance.
6. Implications for Financial AI and Future Directions
By aggregating complementary metric dimensions, EvalBench enables nuanced benchmarking and bias-aware evaluation that advances transparency and governance in financial AI applications. The trade-offs uncovered between coverage and fidelity inform model selection and pipeline configuration for applications in risk analysis, compliance monitoring, investment research, and other high-stakes domains.
Future extensions are anticipated along several axes: expanding the document base beyond S&P 100 filings, integrating more dynamic feedback loops capable of driving self-improvement in extraction and model training, benchmarking against human annotation, and introducing new metric dimensions—such as temporal semantic consistency—to further advance KG evaluation.
7. Summary and Significance
FinReflectKG – EvalBench establishes a unified, reproducible, and richly controlled framework for evaluating knowledge graph extraction in the financial domain (Dimino et al., 7 Oct 2025). The deterministic multi-dimensional judging protocol—with explicit bias controls and agentic document linking—sets a rigorous standard for both extraction quality and evaluation methodology. Reflection-agent-based extraction is demonstrated to be superior for comprehensiveness, precision, and relevance, while single-pass remains optimal for faithful textual grounding. The nuanced metric aggregation and error analysis capabilities position EvalBench as a foundational instrument for transparent, bias-aware development of financial knowledge graphs, supporting advanced downstream reasoning and question answering scenarios in financial AI.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free