DocMath-Eval Benchmark for LLM Reasoning
- DocMath-Eval is a comprehensive benchmark that evaluates LLMs’ abilities to retrieve dispersed evidences, execute multi-step calculations, and adhere to domain-specific conventions.
- It comprises four evaluation sets with increasing difficulty, simulating real-world tasks such as financial document analysis with extensive text and tabular data.
- Experimental results reveal that while state-of-the-art models like GPT-4 perform well, significant gaps remain compared to human experts, underscoring challenges in evidence retrieval and expert reasoning.
DocMath-Eval is a comprehensive benchmark designed to evaluate the numerical reasoning capabilities of LLMs in understanding and analyzing long, heterogeneous documents with both textual and tabular content, particularly in expert domains such as financial analysis. Unlike benchmarks based on exam-style math word problems, DocMath-Eval targets real-world document understanding, requiring the retrieval of dispersed evidence, execution of complex, multi-step calculations, and adherence to domain conventions such as unit conversions. The benchmark exposes limitations of current LLMs in realistic expert tasks and was constructed to support rigorous, reproducible evaluation under challenging conditions.
1. Motivation and Scope
DocMath-Eval addresses an observed performance gap: while LLMs excel on textbook-like math questions, they falter when required to reason over lengthy, specialized documents akin to those encountered by financial analysts or domain experts. These real-world tasks demand the navigation of documents spanning tens of thousands of words, interleaved with multi-row and multi-column tables, necessitating calculations that involve domain-specific norms (e.g., interpreting “millions,” “basis points,” or adjusting for fiscal conventions). The benchmark was established to systematically stress-test LLMs’ abilities to retrieve evidence, reason numerically, and adhere to expert standards within such settings.
2. Dataset Composition and Construction
DocMath-Eval consists of four distinct evaluation sets with increasing difficulty levels, each defined by both document complexity and numerical reasoning challenge:
| Set | Source | # Questions | Context Length | Table Count | Key Features |
|---|---|---|---|---|---|
| DM¹ | TAT-QA & FinQA | 1,459 | Median 500 | 1 | Short, single-table, extraction and arithmetic |
| DM² | MultiHiertt | 793 | Median 2,247 | 4 | Long, multi-table, cross-table operations |
| DM³ | TAT-HQA | 1,621 | Median 253 | 1 | Short, counterfactual, table manipulation |
| DM⁴ | Expert-annotated SEC Filings | 2,101 | Median 24,736 | 48 | Very long, many tables, complex/multi-stage reasoning |
Total: 5,974 high-quality question-answer pairs.
All solutions are expressed as uniform Python programs (“Program-of-Thought”), with manually re-marked evidence spans. For DM⁴, expert annotators constructed questions over real SEC 10-K/10-Q filings, ensuring genuine contextual and numeric challenge. A typical question sequence involves (i) crafting a domain-specific question, (ii) identifying supporting text/table evidence, and (iii) authoring an explicit multi-step Python solution. The following excerpt illustrates an annotation from DM⁴:
Document excerpt: “Net revenues for the year ended December 31, 2022 were $2,450 million, compared to$2,200 million in the prior year.”
Table excerpt (simplified):
| Year | Net Revenue (million USD) | Operating Income (million USD) |
|---|---|---|
| 2021 | 2,200 | 350 |
| 2022 | 2,450 | 400 |
Question: What was the percent increase in operating income from 2021 to 2022? Python solution:
1 2 3 4 5 6 def solution(): op_2021 = 350 op_2022 = 400 increase = op_2022 - op_2021 pct_change = (increase / op_2021) * 100 return pct_change
3. Task Formats and Required Reasoning
DocMath-Eval questions may require:
- Numerical extraction: Parsing values from text or cells.
- Primary arithmetic: Addition, subtraction, multiplication, division.
- Aggregations: Averages, rates, ratios, and more complex aggregations.
- Multi-table cross-referencing: Synthesizing evidence across multiple tables.
- Domain-informed transformation: Unit conversions, time-series comparisons (e.g., YoY), and adjustments for scale (e.g., “in millions”).
This design ensures that performance on DocMath-Eval directly reflects an LLM’s ability to integrate retrieval, domain knowledge, and stepwise reasoning—emulating real analysis workflows rather than artificial examples.
4. Evaluation Protocols and Metrics
Two principal prompting paradigms are supported:
- Chain-of-Thought (CoT): Models must generate a natural-language explanation before providing a concise answer; a parser extracts the final answer.
- Program-of-Thought (PoT): Models output an executable Python program; this is parsed, type-checked, and executed to yield the answer.
The primary metric is accuracy:
where is the question count, the model’s answer, and the gold answer.
When applicable (e.g., answer span or evidence selection), Precision, Recall, and Macro-F₁ scores are computed as:
where TP, FP, and FN denote true positives, false positives, and false negatives, respectively.
For long-context benchmarks (DM², DM⁴), evidence retrieval recall is reported:
Only a top- slice of retrieved evidence is provided to the LLM.
5. Experimental Results and Observations
Key findings from evaluation across 48 models include:
| Model | Prompt | DM¹ | DM² | DM³ | DM⁴ | Avg. Acc |
|---|---|---|---|---|---|---|
| GPT-4 | CoT | 87.9 | 56.5 | 78.0 | 38.5 | 65.2 |
| GPT-4 | PoT | 84.9 | 59.3 | 74.5 | 38.8 | 64.4 |
| GPT-3.5 | CoT | 70.1 | 44.1 | 45.0 | 21.7 | 45.2 |
| GPT-3.5 | PoT | 77.6 | 44.5 | 48.1 | 22.7 | 48.2 |
| Llama-2 70B | CoT | 53.4 | 32.9 | 33.6 | 13.3 | 33.3 |
| Human Expert | – | 91.0 | 87.0 | 84.0 | 76.0 | 84.5 |
Key observations:
- GPT-4 strongly outperforms open-source models on all sets but lags human experts, especially on DM⁴ (38.8% vs. 76.0%).
- PoT prompting offers only modest gains for GPT-series models but reduces performance for some code-heavy LLMs, due to frequent execution failures.
- Open-source models (Llama-2, Mistral) achieve only low double-digit accuracy on DM⁴, even with advanced retrieval and prompting pipelines.
- Retrieval quality critically affects performance: Ada Embedding achieves R@10 of 83% on DM² and 69.2% on DM⁴, boosting downstream accuracy accordingly.
- High PoT execution rate correlates with stronger performance; code-generation failures (low execution rates) cause marked drops in overall accuracy even for models with competitive CoT results.
6. Error Analysis and Core Bottlenecks
Empirical evaluation has highlighted two fundamental limitations in contemporary LLMs as exposed by DocMath-Eval:
- Evidence Retrieval in Long Contexts: For documents of DM⁴ complexity, less than 70% of gold-standard evidence is retrieved in the top-10 candidates even with state-of-the-art dense retrievers (e.g., Ada Embedding), leading to unrecoverable reasoning errors.
- Complex Expert Reasoning: While LLMs can execute simple calculations within small tables, they struggle with multi-table intersection, complex unit transformations, and domain-specific conventions (such as adjusting for figures “in billions” or contextual year-over-year adjustments). GPT-4 errors often stem from missing relevant tables, mislabeling rows/columns, or stepwise calculation mistakes, while open-source backbones commonly select irrelevant evidence in extremely long documents.
7. Implications, Extensions, and Future Directions
DocMath-Eval establishes a quantifiable gap between current LLMs and human expert performance in expert-domain numerical reasoning involving long, semi-structured documents. Identified research avenues include:
- Enhanced Retrieval: Integration of hybrid approaches combining sparse and dense embedding methods, as well as structure-aware table representations, to improve Recall@k on complex document corpora.
- Domain-Aware Pre-training: Leveraging resources such as spreadsheet formulas or synthetic expert QA data to encourage LLMs’ formulaic and domain-aware reasoning abilities.
- Modular Reasoning Pipelines: Decomposing the problem into evidence selection, code-guided arithmetic, and post-hoc answer synthesis, possibly incorporating external tools or verification mechanisms to minimize calculation and grounding errors.
Recent advances in evolvable benchmarking frameworks, such as EvolMathEval (Wang et al., 18 Aug 2025), provide a blueprint for constructing contamination-resistant, automatically evolving benchmarks (see Section 7 of (Wang et al., 18 Aug 2025) for a detailed recipe applicable to DocMath-Eval). These approaches offer mechanisms to ensure sustained difficulty, minimize training data leakage, and maintain relevance amid rapid LLM progress.
This suggests that, while DocMath-Eval represents a rigorous and reproducible standard for evaluating numerical reasoning over long, semi-structured documents, the field is at a nascent stage: robust performance in such settings remains an open research challenge, with benchmarks like DocMath-Eval serving as the foundation for next-generation methodological innovations.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free