DataTales Financial Reporting Benchmark
- DataTales is an advanced evaluation suite that converts large-scale financial tables into coherent, insight-rich narratives through quantitative and analytical operations.
- It leverages 4,900 curated reports with multi-entity market data to enforce strict factual accuracy and enhanced qualitative analysis.
- The benchmark employs multi-dimensional evaluation—including factuality, insight, style, and hierarchical frameworks like KAHAN—to benchmark model performance.
The DataTales Financial Reporting Benchmark is an advanced evaluation suite specifically developed to measure the capabilities of LLMs in “data narration,” a task defined as the transformation of complex, large-scale financial tabular data into accurate, insightful natural language reports. The benchmark builds upon limitations found in prior data-to-text tasks (e.g., RotoWire, WikiBio) by introducing rich analytical operations, strict factuality checks, and domain-specific reporting challenges fundamental to financial applications (Yang et al., 23 Oct 2024).
1. Benchmark Design and Core Objectives
DataTales is constructed to assess data narration that demands not only correct surface-level fact transformation but also multi-step analytical reasoning, nuanced domain terminology, and high-level explanation. The dataset consists of 4,900 curated financial reports, each paired with corresponding multi-entity, multi-day tabular market data, representing a range of asset classes and market sectors.
Distinct from standard data-to-text tasks, each sentence in DataTales exhibits an average of 2.6 analytical operations—including trends, causal inference, and market prediction—which models must map from quantitative inputs to coherent textual insights. Models evaluated on DataTales must therefore perform both quantitative computation and qualitative analysis, emulating the reasoning of financial analysts rather than simple template-based text generation.
2. Dataset Properties and Curation
Reports are curated from reputable financial outlets such as Investrade, Totalfarmmarketing, VT Markets, and LeapRate. Each document is paired with corresponding financial tickers and historical numerical data extracted from sources like Yahoo! Finance. The curation process employs sentence-level classification using an in-context learning setup (ChatGPT-based), categorizing sentences into market movement, market context, external events/influences, and prediction/suggestions. Only sections providing concrete market insights (market movement, predictions) are retained, resulting in high density of analytical content and a 54.4% reduction of non-informative text.
Tabular grounding is central: each financial entity over days is defined by real numerical values . The narration task for model is formalized as
where and are the cardinalities of entities and days, and is the generated narrative.
3. Evaluation Protocols and Metric Design
Evaluation is multi-dimensional and comprises both automated and human assessments:
- Factual Accuracy: Automated metrics use a Named Entity Recognition system (e.g., Stanza) to extract all numerical entities from generated reports. For each such entity, models are prompted with preceding context and must predict the next correct numeric token. The factuality score is the proportion of correctly predicted values.
- Insightfulness: Expert raters assess the impact and significance of each insight (scale 1–5), tracking the richness of explanations and emphasis on key events.
- Stylistic Measures: BLEU scores and cosine similarity (on verbs and entity usage) compare generated reports to human reference narratives.
- Numerical Continuation: At every numeric token, the model’s next-token prediction is compared to the ground truth, capturing consistency and recall over extended factual chains.
This rigorous protocol tests deep analytical capacity, not only surface linguistic fluency.
4. Analytical Challenges and Model Performance
Empirical results demonstrate widespread difficulties among leading LLMs:
- Factual accuracy for numeric values remains low (often <30%), with performance degrading as historical context grows longer for each report.
- Larger LLMs (such as GPT-4) excel in factual accuracy but may underperform in depth of insight relative to specialized, fine-tuned models (e.g., Llama2 variants).
- Human evaluators often rate longer, multi-step reports as more insightful, but automated model-based metrics do not always align with these ratings.
Crucially, the benchmark’s emphasis on advanced analytical operations (e.g., causal, predictive reasoning) exposes the inability of current models to sustain coherent multi-hop inference over extended, multi-column financial datasets.
5. Comparative Perspective and Relationship to Prior Benchmarks
DataTales distinguishes itself from prior benchmarks along several dimensions: | Dimension | RotoWire/WikiBio | DataTales | |-------------------|-----------------------------|-------------------------------| | Input Size | Small tables | Large, multi-entity inputs | | Analytical Depth | Surface-level facts | Average 2.6 ops per sentence | | Domain Coverage | Sports, biographies | Financial/market reporting | | Evaluation | BLEU, ROUGE | Factuality, insight, style |
Unlike extreme multi-label classification benchmarks (e.g., FNXL (Sharma et al., 2023), FinTagging (Wang et al., 27 May 2025)), DataTales demands full narrative synthesis; unlike QA-focused benchmarks (e.g., SECQUE (Yoash et al., 6 Apr 2025), FinanceQA (Mateega et al., 30 Jan 2025)), it evaluates the entire process from data lookup to deep explanation. The structure and multi-hop reasoning requirements of DataTales are also independent of graph-based transaction benchmarks (FinBench (Qi et al., 2023)), which primarily measure database query processing.
6. Hierarchical Approaches and the KAHAN Framework
Recent advances leveraging DataTales include hierarchical analysis and narration strategies, notably the KAHAN framework (Yang et al., 21 Sep 2025). KAHAN organizes analysis into entity-level, pairwise, group, and system layers, synthesizing insights before narrative generation:
- Entity-level analysis poses analytical questions, runs code to extract metrics (e.g., volatility, moving averages), and interprets these quantities.
- Pairwise, group, and system-level synthesis clusters and contextualizes insights, capturing collective market behavior such as sector rotation. This multi-level approach is formally summarized as:
where intermediate insights contribute to the final market-wide analysis.
KAHAN delivered >20% improvement in narrative quality (on GPT-4o) and achieved 98.2% factuality. These improvements suggest hierarchical, knowledge-augmented designs markedly outperform direct prompting and chain-of-thought methods for complex financial narration.
7. Prospects and Research Directions
Identified directions for future work include:
- Integrating intermediate insight recommendation steps (e.g., DataShot, Table2Analysis techniques) before full narrative synthesis.
- Combining text data with multimodal chart or visualization elements to augment factual grounding.
- Refining evaluation metrics to distinguish between numerical reliability, deep reasoning, and domain insight, moving beyond token-level matches.
- Expanding fine-tuning and context windows to enable robust analysis on extended, heterogeneous market data.
These lines of inquiry align with ongoing efforts to establish benchmarks that explicitly assess auditability, transparency, and analytical robustness in financial reporting systems.
DataTales establishes a new standard for financial narrative benchmarking, highlighting weaknesses in current LLMs and providing a structure for iterative research on narrating large-scale, high-stakes financial data (Yang et al., 23 Oct 2024, Yang et al., 21 Sep 2025). Its multi-dimensional dataset, rigorous metric suite, and hierarchical framework lay the foundations for future evaluation and development of LLMs in analytic finance.