EDINET-Bench: Japanese Financial Analytics Benchmark
- EDINET-Bench is an open-source benchmark that uses authentic Japanese EDINET filings to assess LLM performance on complex financial tasks.
- It automates data parsing and labeling from over 40,000 reports, enabling reproducible evaluation for tasks like fraud detection, earnings forecasting, and industry classification.
- Benchmark results indicate LLMs slightly outperform traditional models in fraud detection while highlighting challenges in nuanced tasks such as earnings prediction.
EDINET-Bench is an open-source Japanese financial benchmark designed to evaluate the capabilities of LLMs on advanced financial analytic tasks using authentic annual report filings from Japan’s Electronic Disclosure for Investors' NETwork (EDINET). It represents the first benchmark of its kind with a focus on societally-relevant, expert-level decision problems in the Japanese market, providing researchers and practitioners with tools and datasets for comprehensive, reproducible evaluation in financial analytics using modern LLMs.
1. Purpose and Scope
EDINET-Bench addresses the scarcity of challenging financial benchmarks, especially for Japanese corporate data. The benchmark targets LLM evaluation across financial tasks that demand sophisticated reasoning rather than simple extraction or question answering. Principal tasks include:
- Accounting Fraud Detection: Classifying annual reports as fraudulent or non-fraudulent.
- Earnings Forecasting: Predicting if a company’s profits will rise or fall in the subsequent fiscal year.
- Industry Prediction: Assigning a company to its industry segment based only on its report.
Contrasting with extractive or shallow classification datasets, EDINET-Bench emphasizes evaluative, expert-level decision-making based on real, heterogeneous financial documents.
2. Dataset Construction
Data Sources and Coverage
The EDINET-Bench dataset utilizes more than 40,000 annual and amended reports from approximately 4,000 publicly listed Japanese companies, spanning April 2014 to April 2025. All filings are sourced from the official EDINET system, which provides both structured (TSV) and unstructured (PDF) versions as well as metadata.
Data Processing and Parsing
A dedicated pipeline—edinet2dataset—was developed to automate:
- Download and parsing of filings,
- Extraction of structured components (meta, summary, balance sheet [BS], profit-loss [PL], cash flow [CF], textual content),
- Standardization using the Polars library for robust tabular handling.
This infrastructure supports continuous, automated benchmark expansion as new filings are posted.
Task-Specific Labeling Pipelines
- Fraud Detection: Fraudulent reports are flagged by analyzing reasons for amendment in official PDFs, with an LLM (Claude 3.7 Sonnet) applying a rules-based prompt for decision. Negative samples are drawn from companies with no known fraud history.
- Earnings Forecasting: Pairs of consecutive annual reports from ~1,000 randomly chosen companies are compared; profit increases are labeled “increase,” and otherwise “decrease.”
- Industry Prediction: Each company is mapped to its industry segment using a reduced schema (16 categories, from the original TOPIX-33), ensuring samples are balanced.
All labeling and extraction are automated, enabling reproducibility and expansion.
Task | Train/Test Size | Labeling Strategy |
---|---|---|
Fraud | 865 train / 224 test | LLM-based reason analysis + negative sampling |
Earnings | 549 train / 451 test | Year-on-year profit delta |
Industry | ~35 per class (no split) | EDINET metadata, class aggregation |
3. Evaluation Metrics and Results
Metrics
- ROC-AUC for binary tasks (fraud, earnings):
- Matthews Correlation Coefficient (MCC):
- Accuracy for industry (multi-class) prediction.
Results
Model | Fraud ROC-AUC | Fraud MCC | Earnings ROC-AUC | Earnings MCC | Industry Accuracy |
---|---|---|---|---|---|
Claude 3.5 Sonnet (all info) | 0.73 ± 0.02 | 0.32 ± 0.02 | 0.52 ± 0.02 | 0.08 ± 0.02 | 0.41 (BS+CF+PL) |
Logistic Regression | 0.68 | 0.17 | 0.56 | 0.05 | -- |
Key findings: State-of-the-art LLMs marginally outperform logistic regression for fraud detection but not for earnings forecasts. Structured financial indicators alone allow logistic regression to achieve comparable results, highlighting LLMs’ current limitations in capturing deeper, complex patterns from full reports. For industry prediction, LLMs successfully leverage structured data, performing well above random baselines.
4. Technical Challenges and Insights
LLMs face substantial difficulties on these financial decision tasks:
- Intrinsic Task Complexity: Both fraud detection and earnings forecasting require reasoning over subtle patterns and temporal context, which is challenging from a single annual report.
- Textual and Tabular Heterogeneity: LLMs benefit from unstructured sections for fraud but not for earnings prediction. Structured tabular sections effectively signal industry.
- Interpretability and Reasoning: LLMs sometimes mimic naive heuristics (e.g., associating size with fraud likelihood), indicating incomplete exploitation of nuanced textual cues.
- Label and Parsing Noise: Use of LLMs for amending label assignment and automatic parsing may introduce classification noise and inconsistencies.
- Potential Training Set Contamination: The open-source corpus consists of public filings, but temporal splits and task design minimize pretraining artifacts affecting evaluation.
5. Design Principles and Automation
EDINET-Bench reflects key infrastructural innovations:
- Automation: The data collection, parsing, and labeling pipelines are fully automated, supporting continuous, up-to-date expansion as new EDINET reports become available.
- Modular Tooling: The edinet2dataset tool modularizes download, parsing, and conversion, supporting flexible research and reproducibility.
- Open Source and Accessibility: Complete datasets and associated code for both construction and task evaluation are published to facilitate future research.
- Dataset: https://huggingface.co/datasets/SakanaAI/EDINET-Bench
- Data construction tool: https://github.com/SakanaAI/edinet2dataset
- Benchmark code: https://github.com/SakanaAI/EDINET-Bench
6. Impact, Limitations, and Future Directions
EDINET-Bench sets a new standard for financial analytics benchmarking by focusing on Japanese filings and decision-level evaluation with LLMs. It demonstrates that even the most advanced LLMs only slightly outperform traditional models in binary classification tasks, underscoring the need for:
- Domain-Specific Adaptation: General-purpose LLMs remain insufficient for complex financial analytics. Advances in pretraining or fine-tuning on domain data are necessary for further progress.
- Expanded Context and Multimodal Reasoning: Incorporating external knowledge bases, temporal trends, or multi-document reasoning may improve predictive accuracy, particularly for fraud detection.
- Enhanced Evaluation Paradigms: The benchmark motivates the use of rubric-based and agentic evaluation pipelines to more accurately reflect real-world analytic settings.
- Adaptive, Automated Benchmarking: The infrastructure allows for continuous, real-world performance tracking and research scalability.
7. Significance Within the Broader Research Landscape
EDINET-Bench provides the first scalable, fully-automated, challenging benchmark for Japanese corporate filings, exposing fundamental gaps in current LLM capabilities for critical financial decision-making. It serves as a reproducible, extensible platform catalyzing research on domain adaptation, multimodal analytic AI, and advanced benchmarking methodologies in finance. Its design principles and methodologies offer a template for analogous efforts in other regions or financial domains.