DocMath-Eval Benchmark for LLM Reasoning

Updated 16 November 2025

DocMath-Eval is a comprehensive benchmark that evaluates LLMs’ abilities to retrieve dispersed evidences, execute multi-step calculations, and adhere to domain-specific conventions.
It comprises four evaluation sets with increasing difficulty, simulating real-world tasks such as financial document analysis with extensive text and tabular data.
Experimental results reveal that while state-of-the-art models like GPT-4 perform well, significant gaps remain compared to human experts, underscoring challenges in evidence retrieval and expert reasoning.

DocMath-Eval is a comprehensive benchmark designed to evaluate the numerical reasoning capabilities of LLMs in understanding and analyzing long, heterogeneous documents with both textual and tabular content, particularly in expert domains such as financial analysis. Unlike benchmarks based on exam-style math word problems, DocMath-Eval targets real-world document understanding, requiring the retrieval of dispersed evidence, execution of complex, multi-step calculations, and adherence to domain conventions such as unit conversions. The benchmark exposes limitations of current LLMs in realistic expert tasks and was constructed to support rigorous, reproducible evaluation under challenging conditions.

1. Motivation and Scope

DocMath-Eval addresses an observed performance gap: while LLMs excel on textbook-like math questions, they falter when required to reason over lengthy, specialized documents akin to those encountered by financial analysts or domain experts. These real-world tasks demand the navigation of documents spanning tens of thousands of words, interleaved with multi-row and multi-column tables, necessitating calculations that involve domain-specific norms (e.g., interpreting “millions,” “basis points,” or adjusting for fiscal conventions). The benchmark was established to systematically stress-test LLMs’ abilities to retrieve evidence, reason numerically, and adhere to expert standards within such settings.

2. Dataset Composition and Construction

DocMath-Eval consists of four distinct evaluation sets with increasing difficulty levels, each defined by both document complexity and numerical reasoning challenge:

Set	Source	# Questions	Context Length	Table Count	Key Features
DM¹	TAT-QA & FinQA	1,459	Median 500	1	Short, single-table, extraction and arithmetic
DM²	MultiHiertt	793	Median 2,247	4	Long, multi-table, cross-table operations
DM³	TAT-HQA	1,621	Median 253	1	Short, counterfactual, table manipulation
DM⁴	Expert-annotated SEC Filings	2,101	Median 24,736	48	Very long, many tables, complex/multi-stage reasoning

Total: 5,974 high-quality question-answer pairs.

All solutions are expressed as uniform Python programs (“Program-of-Thought”), with manually re-marked evidence spans. For DM⁴, expert annotators constructed questions over real SEC 10-K/10-Q filings, ensuring genuine contextual and numeric challenge. A typical question sequence involves (i) crafting a domain-specific question, (ii) identifying supporting text/table evidence, and (iii) authoring an explicit multi-step Python solution. The following excerpt illustrates an annotation from DM⁴:

Document excerpt: “Net revenues for the year ended December 31, 2022 were $2,450 million, compared to$2,200 million in the prior year.”

Table excerpt (simplified):

Year	Net Revenue (million USD)	Operating Income (million USD)
2021	2,200	350
2022	2,450	400

Question: What was the percent increase in operating income from 2021 to 2022? Python solution:

def solution():
    op_2021 = 350
    op_2022 = 400
    increase = op_2022 - op_2021
    pct_change = (increase / op_2021) * 100
    return pct_change

3. Task Formats and Required Reasoning

DocMath-Eval questions may require:

Numerical extraction: Parsing values from text or cells.
Primary arithmetic: Addition, subtraction, multiplication, division.
Aggregations: Averages, rates, ratios, and more complex aggregations.
Multi-table cross-referencing: Synthesizing evidence across multiple tables.
Domain-informed transformation: Unit conversions, time-series comparisons (e.g., YoY), and adjustments for scale (e.g., “in millions”).

This design ensures that performance on DocMath-Eval directly reflects an LLM’s ability to integrate retrieval, domain knowledge, and stepwise reasoning—emulating real analysis workflows rather than artificial examples.

4. Evaluation Protocols and Metrics

Two principal prompting paradigms are supported:

Chain-of-Thought (CoT): Models must generate a natural-language explanation before providing a concise answer; a parser extracts the final answer.
Program-of-Thought (PoT): Models output an executable Python program; this is parsed, type-checked, and executed to yield the answer.

The primary metric is accuracy:

$\mathrm{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(\hat{y}_i = y_i)$

where $N$ is the question count, $\hat{y}_i$ the model’s answer, and $y_i$ the gold answer.

When applicable (e.g., answer span or evidence selection), Precision, Recall, and Macro-F₁ scores are computed as:

$\mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},\quad \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},\quad \mathrm{F}_1 = \frac{2\,\mathrm{TP}}{2\,\mathrm{TP}+\mathrm{FP}+\mathrm{FN}}$

where TP, FP, and FN denote true positives, false positives, and false negatives, respectively.

For long-context benchmarks (DM², DM⁴), evidence retrieval recall is reported:

$\mathrm{R@}k = \frac{\text{\# questions with gold evidence in top }k}{\text{total questions}}$

Only a top- $n$ slice of retrieved evidence is provided to the LLM.

5. Experimental Results and Observations

Key findings from evaluation across 48 models include:

Model	Prompt	DM¹	DM²	DM³	DM⁴	Avg. Acc
GPT-4	CoT	87.9	56.5	78.0	38.5	65.2
GPT-4	PoT	84.9	59.3	74.5	38.8	64.4
GPT-3.5	CoT	70.1	44.1	45.0	21.7	45.2
GPT-3.5	PoT	77.6	44.5	48.1	22.7	48.2
Llama-2 70B	CoT	53.4	32.9	33.6	13.3	33.3
Human Expert	–	91.0	87.0	84.0	76.0	84.5

Key observations:

GPT-4 strongly outperforms open-source models on all sets but lags human experts, especially on DM⁴ (38.8% vs. 76.0%).
PoT prompting offers only modest gains for GPT-series models but reduces performance for some code-heavy LLMs, due to frequent execution failures.
Open-source models (Llama-2, Mistral) achieve only low double-digit accuracy on DM⁴, even with advanced retrieval and prompting pipelines.
Retrieval quality critically affects performance: Ada Embedding achieves R@10 of 83% on DM² and 69.2% on DM⁴, boosting downstream accuracy accordingly.
High PoT execution rate correlates with stronger performance; code-generation failures (low execution rates) cause marked drops in overall accuracy even for models with competitive CoT results.

6. Error Analysis and Core Bottlenecks

Empirical evaluation has highlighted two fundamental limitations in contemporary LLMs as exposed by DocMath-Eval:

Evidence Retrieval in Long Contexts: For documents of DM⁴ complexity, less than 70% of gold-standard evidence is retrieved in the top-10 candidates even with state-of-the-art dense retrievers (e.g., Ada Embedding), leading to unrecoverable reasoning errors.
Complex Expert Reasoning: While LLMs can execute simple calculations within small tables, they struggle with multi-table intersection, complex unit transformations, and domain-specific conventions (such as adjusting for figures “in billions” or contextual year-over-year adjustments). GPT-4 errors often stem from missing relevant tables, mislabeling rows/columns, or stepwise calculation mistakes, while open-source backbones commonly select irrelevant evidence in extremely long documents.

7. Implications, Extensions, and Future Directions

DocMath-Eval establishes a quantifiable gap between current LLMs and human expert performance in expert-domain numerical reasoning involving long, semi-structured documents. Identified research avenues include:

Enhanced Retrieval: Integration of hybrid approaches combining sparse and dense embedding methods, as well as structure-aware table representations, to improve Recall@k on complex document corpora.
Domain-Aware Pre-training: Leveraging resources such as spreadsheet formulas or synthetic expert QA data to encourage LLMs’ formulaic and domain-aware reasoning abilities.
Modular Reasoning Pipelines: Decomposing the problem into evidence selection, code-guided arithmetic, and post-hoc answer synthesis, possibly incorporating external tools or verification mechanisms to minimize calculation and grounding errors.

Recent advances in evolvable benchmarking frameworks, such as EvolMathEval (Wang et al., 18 Aug 2025), provide a blueprint for constructing contamination-resistant, automatically evolving benchmarks (see Section 7 of (Wang et al., 18 Aug 2025) for a detailed recipe applicable to DocMath-Eval). These approaches offer mechanisms to ensure sustained difficulty, minimize training data leakage, and maintain relevance amid rapid LLM progress.

This suggests that, while DocMath-Eval represents a rigorous and reproducible standard for evaluating numerical reasoning over long, semi-structured documents, the field is at a nascent stage: robust performance in such settings remains an open research challenge, with benchmarks like DocMath-Eval serving as the foundation for next-generation methodological innovations.

PDF Markdown Chat (Pro)

References (1)

EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DocMath-Eval Benchmark.

DocMath-Eval Benchmark for LLM Reasoning

1. Motivation and Scope

2. Dataset Composition and Construction

3. Task Formats and Required Reasoning

4. Evaluation Protocols and Metrics

5. Experimental Results and Observations

6. Error Analysis and Core Bottlenecks

7. Implications, Extensions, and Future Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DocMath-Eval Benchmark for LLM Reasoning

1. Motivation and Scope

2. Dataset Composition and Construction

3. Task Formats and Required Reasoning

4. Evaluation Protocols and Metrics

5. Experimental Results and Observations

6. Error Analysis and Core Bottlenecks

7. Implications, Extensions, and Future Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research