- The paper challenges the binary exact match metric by treating temporal QA as a numerical estimation problem.
- It introduces the TempAnswerQA benchmark and implements regression-based metrics—sMAPE and MASE—to capture error magnitude and direction.
- Empirical results reveal that EM can mask significant prediction errors, urging the adoption of risk-aware, scale-sensitive evaluations.
Revisiting Exact Match: Evaluating Temporal Reasoning in LLMs as a Numerical Estimation Problem
Introduction
This work challenges the prevailing use of the exact match (EM) metric for evaluating temporal question answering (QA) in LLMs. Temporal QA typically yields numeric responses (dates, durations), yet the dominant EM metric is a binary string comparison that treats all errors identically, regardless of magnitude. EM cannot distinguish between answers that are off by 1 hour or 10 years, compromising its utility for risk-sensitive applications in time-critical domains. The authors argue for treating temporal QA as a numerical estimation problem and develop a benchmark, TempAnswerQA, facilitating regression-based evaluation using scale-invariant measures that capture both error size and direction.
Methodology
Benchmark Construction
TempAnswerQA is assembled by filtering recent temporal benchmarks—Test of Time (ToT) and TempTabQA (TTQA)—to retain only questions with numeric temporal answers (dates, durations, ages). Each entry is annotated with temporal units (year, month, day, minute, second) and a machine-parsable answer format, enabling robust conversion to numeric Python objects for meaningful metric computation.
- ToT: Synthetic arithmetic and semantic questions on temporal relationships and computations, critical for disentangling LLM memorization from domain reasoning.
- TTQA: Extracted from Wikipedia tables, covers open-domain, real-world knowledge with temporal semantics.
A stratified analysis by temporal unit and dataset is performed to ensure coverage across granularities and answer types.
Metric Selection and Implementation
EM is compared against two regression-based, scale-free metrics:
- Symmetric Mean Absolute Percentage Error (sMAPE):
sMAPE=n100%i=1∑n∣y^i∣+∣yi∣∣y^i−yi∣
sMAPE achieves unit invariance and penalizes relative error, suitable for model ranking when error magnitude is important.
- Mean Absolute Scaled Error (MASE):
MASE=n1i=1∑n∣yi−Yˉu∣∣y^i−yi∣
MASE further normalizes model error by dataset variance, measuring performance relative to a "reasonable guess" baseline rooted in the answer distribution for each unit. For multimodal answer distributions (e.g., year-of-birth vs. age), unsupervised clustering is applied for fairer scaling.
EM, sMAPE, and MASE are compared against baseline predictors (mean/median). Responses from multiple open-source LLMs (Llama-3.3-70B, Qwen2.5, Phi-4, etc.), using both zero-shot and few-shot prompts, are collected and evaluated.
Results
EM’s Insensitivity to Error Magnitude
The empirical analysis demonstrates that EM yields a misleading "all-or-nothing" assessment of temporal predictions. Two models might both exhibit 50% EM, yet contain radically different average error magnitudes (as measured by sMAPE or MASE). For example, in TTQA, models with EM ≈ 80% can have sMAPE ranging from 1%–20%. Particularly, sMAPE exposes "near-miss" and "outlier" cases, refining the picture of LLM temporal reasoning.
- Models trained on synthetic data can exhibit misleadingly strong EM scores but suffer from large scaled errors.
- Outlier responses are easily masked under EM, but emerge with sMAPE/MASE (e.g., arithmetic mistakes leading to orders-of-magnitude error).
Model Ranking Is Metric-Sensitive
While sMAPE and EM are correlated (Spearman ρ < -0.9), they can diverge in model ranking due to differences in error distribution. MASE, scaling by the dataset deviation, further reshuffles the ranking and penalizes large domain-implausible mistakes (e.g., a predicted athlete age of 400 years), which are not reflected in EM or sMAPE values.
- Larger models (Llama-3.3, Qwen2.5) generally outperform small models across all metrics, but synthetic pretraining (Phi/Qwen) often yields higher scaled errors per MASE compared to non-synthetic (Llama) models, suggesting different retention of temporal priors.
Error Distribution and Transitional Times
The analysis of error histograms reveals a preponderance of ±1 mistakes – most frequent in duration-related questions (i.e., "transitional times"), and often explained by ambiguity in temporal calculation (such as "birthday yet or not in this year"). This effect is stronger for duration units/age and points to both model uncertainty and training set signal bias (transitional dates are frequent in naturally occurring text).
Furthermore, error directionality is non-symmetric in MASE: positive errors ("over-estimates") have higher expected cost than negative errors, which may interact with domain risk.
Implementation Considerations
Practical Workflow
- Dataset Labeling: Raw QA data must include metadata for expected temporal unit and a parseable answer format. Data parsing and type conversion are carried out using the Python
datetime, timedelta, or numerical primitives, with error handling for unparseable outputs.
- Metric Computation: When deploying TempAnswerQA-style evaluations, implement logic to:
- Detect and convert digit-only or YYYY-formatted answers.
- Cluster answer distributions for MASE, using libraries like
scikit-learn (e.g., HDBSCAN for mode separation).
- Apply sMAPE only where denominator ≠ 0 and both the target and prediction are numeric and of matching type.
- Use MASE for date percentage errors and ambiguous scalings.
- Model Selection: When error cost is non-linear or high-precision is required (e.g., medication schedules or age-critical contexts), prefer model rankings by sMAPE/MASE over EM. For general-purpose settings where "close enough" is sufficient, EM may suffice, but awareness of its masking characteristics is crucial.
Deployment Guidance
- Benchmarks: Avoid reporting EM in isolation for temporal tasks; always supplement with at least sMAPE and, where possible, MASE.
- Prompt Engineering: To ensure model outputs are consistently parseable, employ system prompts specifying answer formatting and suffixing with clear instruction cues; use few-shot exemplars to increase format compliance, especially when the evaluation script is strict about numeric formatting.
Computational Overhead
- Scalability: sMAPE/MASE metric evaluation imposes minimal computational overhead, dominated by answer parsing. Clustering for MASE scaling is a one-time setup per unit/type.
- Dataset Growth: As synthetic benchmarks increase in size and diversity, cluster-based MASE and unit-wise error normalizations must be re-validated to avoid penalizing multi-modal distributions unfairly.
Implications and Future Work
This study provides a formal foundation for risk- and cost-aware evaluation of temporal QA in LLMs. In domains such as healthcare (e.g., medication schedule computation), logistics, or finance, understanding whether an LLM’s answer is "almost correct" or "wildly off" is essential. Regression-based metrics allow for:
- Improved model selection: penalize high-risk errors even for weakly worded questions or ambiguous contexts.
- Fine-grained diagnostics: Identify equivalence classes of error (e.g., transition ambiguity, arithmetic failures) that EM cannot expose.
- Benchmark innovation: Future datasets and leaderboards should natively incorporate sMAPE/MASE, annotated time units, and per-task error cost mappings.
Limiting factors include the need for parseable model responses, coverage of ambiguous temporal expressions (periods, recurring events), and the challenge of defining domain-appropriate scaling and clustering for MASE. Future developments should consider human-in-the-loop calibration for error acceptability, adaptive error scaling (e.g., event-specific bounds), LLM-as-a-judge approaches, and automated prompt validation pipelines.
The findings also argue for integrating tool-use or symbolic calculators as a default architectural primitive for LLMs engaged in temporal reasoning, as arithmetic mistakes rather than conceptual misunderstandings often drive scaled error magnitude.
Conclusion
Temporal QA in LLMs is fundamentally a numerical estimation challenge. EM is inadequate for evaluating temporal grounding, especially under variable risk regimes. sMAPE and MASE expose model weaknesses in domain knowledge, arithmetic, and sensitivity to natural ambiguity in event timing, and they provide practitioners with actionable differentiators for model selection and deployment risk management. As LLMs are further adopted into domains requiring temporally robust QA, support for regression-based evaluation will be essential for both correctness and safety.