Time to Revist Exact Match

Published 20 Sep 2025 in cs.CL | (2509.16720v1)

Abstract: Temporal question answering is an established method for evaluating temporal reasoning in LLMs. Expected answers are often numeric (e.g., dates or durations), yet model responses are evaluated like regular text with exact match (EM), unable to distinguish small from large errors. In this investigative work, we frame temporal question answering as a numerical estimation task to assess the shortcomings of EM. We introduce TempAnswerQA, a benchmark distilled from Test of Time and TempTabQA, where all questions require a numerical, temporal answer, allowing us to evaluate models beyond EM. We use the forecasting metrics symmetric mean absolute percentage error (sMAPE) and mean absolute scaled error (MASE). With sMAPE, we find that error size and EM are decoupled. Models with low EM still have low sMAPE (both ~20%), and some models have high sMAPE despite high EM. Scaling errors by the deviation of the ground truth data with MASE reshuffles model rankings compared to EM, revealing gaps in models' understanding of temporal domain knowledge, especially when trained with synthetic data. Lastly, the models' most frequent error is to deviate by only $\pm1$ from the ground truth. sMAPE and MASE, unlike EM, adequately weight these errors. Our findings underscore the need for specialised metrics for temporal QA tasks. Code and data are available on https://github.com/aauss/temporal-answer-qa.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper challenges the binary exact match metric by treating temporal QA as a numerical estimation problem.
It introduces the TempAnswerQA benchmark and implements regression-based metrics—sMAPE and MASE—to capture error magnitude and direction.
Empirical results reveal that EM can mask significant prediction errors, urging the adoption of risk-aware, scale-sensitive evaluations.

Revisiting Exact Match: Evaluating Temporal Reasoning in LLMs as a Numerical Estimation Problem

Introduction

This work challenges the prevailing use of the exact match (EM) metric for evaluating temporal question answering (QA) in LLMs. Temporal QA typically yields numeric responses (dates, durations), yet the dominant EM metric is a binary string comparison that treats all errors identically, regardless of magnitude. EM cannot distinguish between answers that are off by 1 hour or 10 years, compromising its utility for risk-sensitive applications in time-critical domains. The authors argue for treating temporal QA as a numerical estimation problem and develop a benchmark, TempAnswerQA, facilitating regression-based evaluation using scale-invariant measures that capture both error size and direction.

Methodology

Benchmark Construction

TempAnswerQA is assembled by filtering recent temporal benchmarks—Test of Time (ToT) and TempTabQA (TTQA)—to retain only questions with numeric temporal answers (dates, durations, ages). Each entry is annotated with temporal units (year, month, day, minute, second) and a machine-parsable answer format, enabling robust conversion to numeric Python objects for meaningful metric computation.

ToT: Synthetic arithmetic and semantic questions on temporal relationships and computations, critical for disentangling LLM memorization from domain reasoning.
TTQA: Extracted from Wikipedia tables, covers open-domain, real-world knowledge with temporal semantics.

A stratified analysis by temporal unit and dataset is performed to ensure coverage across granularities and answer types.

Metric Selection and Implementation

EM is compared against two regression-based, scale-free metrics:

Symmetric Mean Absolute Percentage Error (sMAPE):

$sMAPE = \frac{100\%}{n} \sum_{i=1}^{n} \frac{|\hat{y}_i - y_i|}{|\hat{y}_i| + |y_i| }$

sMAPE achieves unit invariance and penalizes relative error, suitable for model ranking when error magnitude is important.

Mean Absolute Scaled Error (MASE):

$MASE = \frac{1}{n} \sum_{i=1}^{n} \frac{|\hat{y}_i - y_i|}{|y_i - \bar{Y}_u|}$

MASE further normalizes model error by dataset variance, measuring performance relative to a "reasonable guess" baseline rooted in the answer distribution for each unit. For multimodal answer distributions (e.g., year-of-birth vs. age), unsupervised clustering is applied for fairer scaling.

EM, sMAPE, and MASE are compared against baseline predictors (mean/median). Responses from multiple open-source LLMs (Llama-3.3-70B, Qwen2.5, Phi-4, etc.), using both zero-shot and few-shot prompts, are collected and evaluated.

Results

EM’s Insensitivity to Error Magnitude

The empirical analysis demonstrates that EM yields a misleading "all-or-nothing" assessment of temporal predictions. Two models might both exhibit 50% EM, yet contain radically different average error magnitudes (as measured by sMAPE or MASE). For example, in TTQA, models with EM ≈ 80% can have sMAPE ranging from 1%–20%. Particularly, sMAPE exposes "near-miss" and "outlier" cases, refining the picture of LLM temporal reasoning.

Models trained on synthetic data can exhibit misleadingly strong EM scores but suffer from large scaled errors.
Outlier responses are easily masked under EM, but emerge with sMAPE/MASE (e.g., arithmetic mistakes leading to orders-of-magnitude error).

Model Ranking Is Metric-Sensitive

While sMAPE and EM are correlated (Spearman ρ < -0.9), they can diverge in model ranking due to differences in error distribution. MASE, scaling by the dataset deviation, further reshuffles the ranking and penalizes large domain-implausible mistakes (e.g., a predicted athlete age of 400 years), which are not reflected in EM or sMAPE values.

Larger models (Llama-3.3, Qwen2.5) generally outperform small models across all metrics, but synthetic pretraining (Phi/Qwen) often yields higher scaled errors per MASE compared to non-synthetic (Llama) models, suggesting different retention of temporal priors.

Error Distribution and Transitional Times

The analysis of error histograms reveals a preponderance of $\pm1$ mistakes – most frequent in duration-related questions (i.e., "transitional times"), and often explained by ambiguity in temporal calculation (such as "birthday yet or not in this year"). This effect is stronger for duration units/age and points to both model uncertainty and training set signal bias (transitional dates are frequent in naturally occurring text).

Furthermore, error directionality is non-symmetric in MASE: positive errors ("over-estimates") have higher expected cost than negative errors, which may interact with domain risk.

Implementation Considerations

Practical Workflow

Dataset Labeling: Raw QA data must include metadata for expected temporal unit and a parseable answer format. Data parsing and type conversion are carried out using the Python datetime, timedelta, or numerical primitives, with error handling for unparseable outputs.
Metric Computation: When deploying TempAnswerQA-style evaluations, implement logic to:
- Detect and convert digit-only or YYYY-formatted answers.
- Cluster answer distributions for MASE, using libraries like scikit-learn (e.g., HDBSCAN for mode separation).
- Apply sMAPE only where denominator ≠ 0 and both the target and prediction are numeric and of matching type.
- Use MASE for date percentage errors and ambiguous scalings.
Model Selection: When error cost is non-linear or high-precision is required (e.g., medication schedules or age-critical contexts), prefer model rankings by sMAPE/MASE over EM. For general-purpose settings where "close enough" is sufficient, EM may suffice, but awareness of its masking characteristics is crucial.

Deployment Guidance

Benchmarks: Avoid reporting EM in isolation for temporal tasks; always supplement with at least sMAPE and, where possible, MASE.
Prompt Engineering: To ensure model outputs are consistently parseable, employ system prompts specifying answer formatting and suffixing with clear instruction cues; use few-shot exemplars to increase format compliance, especially when the evaluation script is strict about numeric formatting.

Computational Overhead

Scalability: sMAPE/MASE metric evaluation imposes minimal computational overhead, dominated by answer parsing. Clustering for MASE scaling is a one-time setup per unit/type.
Dataset Growth: As synthetic benchmarks increase in size and diversity, cluster-based MASE and unit-wise error normalizations must be re-validated to avoid penalizing multi-modal distributions unfairly.

Implications and Future Work

This study provides a formal foundation for risk- and cost-aware evaluation of temporal QA in LLMs. In domains such as healthcare (e.g., medication schedule computation), logistics, or finance, understanding whether an LLM’s answer is "almost correct" or "wildly off" is essential. Regression-based metrics allow for:

Improved model selection: penalize high-risk errors even for weakly worded questions or ambiguous contexts.
Fine-grained diagnostics: Identify equivalence classes of error (e.g., transition ambiguity, arithmetic failures) that EM cannot expose.
Benchmark innovation: Future datasets and leaderboards should natively incorporate sMAPE/MASE, annotated time units, and per-task error cost mappings.

Limiting factors include the need for parseable model responses, coverage of ambiguous temporal expressions (periods, recurring events), and the challenge of defining domain-appropriate scaling and clustering for MASE. Future developments should consider human-in-the-loop calibration for error acceptability, adaptive error scaling (e.g., event-specific bounds), LLM-as-a-judge approaches, and automated prompt validation pipelines.

The findings also argue for integrating tool-use or symbolic calculators as a default architectural primitive for LLMs engaged in temporal reasoning, as arithmetic mistakes rather than conceptual misunderstandings often drive scaled error magnitude.

Conclusion

Temporal QA in LLMs is fundamentally a numerical estimation challenge. EM is inadequate for evaluating temporal grounding, especially under variable risk regimes. sMAPE and MASE expose model weaknesses in domain knowledge, arithmetic, and sensitivity to natural ambiguity in event timing, and they provide practitioners with actionable differentiators for model selection and deployment risk management. As LLMs are further adopted into domains requiring temporally robust QA, support for regression-based evaluation will be essential for both correctness and safety.

Markdown Report Issue