TempoBench: Temporal Reasoning Benchmark

Updated 3 November 2025

TempoBench is a formally grounded diagnostic benchmark that evaluates LLMs on multi-step temporal and causal reasoning using automata synthesized from linear temporal logic.
It systematically parametrizes task difficulty through features like effect depth and state count, ensuring interpretable and reproducible evaluation of trace execution and causal credit assignment.
Experimental findings reveal significant LLM limitations, with performance dropping drastically on complex causal tasks, underscoring the need for tailored architectural advancements.

TempoBench is a formally grounded diagnostic benchmark for LLMs, designed to deconstruct and evaluate their performance on multi-step temporal and causal reasoning tasks over formally specified reactive systems. Unlike prior benchmarks relying on ad hoc datasets or mathematical proof assistants, TempoBench provides systematic parametrization of task difficulty and verifiable ground truth for task execution and causal credit assignment. The benchmark employs finite-state automata synthesized from linear temporal logic specifications to generate test instances, enabling interpretable, scalable, and reproducible evaluation of LLM reasoning ability.

1. Formal Foundations and Motivation

TempoBench was conceived to address limitations in existing reasoning benchmarks, which either lack formal verifiability or fail to capture the agentic, decision-chain structures typical in realistic business processes or code agents. Traditional approaches either use synthetic datasets prone to bias and unverifiable correctness, or formal systems like Lean that do not reflect multi-step temporal workflows. By leveraging automata-theoretic synthesis from linear temporal logic (LTL), TempoBench explicitly models reactive systems as finite-state automata (FSAs), specified as:

$A = (Q, E, \delta, q_0, F)$

where $Q$ is the set of states, $E$ is the input/output alphabet, $\delta$ is the transition relation, $q_0$ is the initial state, and $F$ is the set of accepting states. This structure guarantees precise, deterministic ground truth for both system trace execution and causal relationships, supporting rigorous grading and difficulty control.

2. Benchmark Structure and Task Definitions

TempoBench comprises two principal tasks:

Temporal Trace Evaluation (TTE): Given an automaton $A$ and a trace $I$ , the model must determine whether $I$ is accepted by $A$ , i.e., whether the sequence of transitions correctly proceeds to an accepting state. This task tests the LLM's ability to simulate or verify stepwise system execution and compliance with temporal logic specifications.
Temporal Causal Evaluation (TCE): Given an automaton $A$ , a trace $z$ , and an output (effect) $e$ at timestep $T_i$ , the model is required to identify the minimal set of input actions across all previous timesteps that are necessary for $e$ to occur at $T_i$ . Formally, causality is defined such that a set of inputs $C$ is a cause for $E$ at time $t$ if:

$T = C$ and $T = E$ ,
Counterfactual removal of $C$ eliminates $E$ ,
$C$ is minimal (no strict subset $C'$ of $C$ also satisfies 1-2).

Difficulty is systematically controlled by parametrizing features such as effect depth (distance from cause to effect), system state-space size, transition count, causal input cardinality, and trace diversity. This supports interpretable scaling from "normal" to "hard" problem regimes.

3. Evaluation Metrics and Scoring Protocols

Performance on TempoBench is quantified using precise, formally defined metrics:

Precision: $\frac{TP}{TP + FP}$
Recall: $\frac{TP}{TP + FN}$
F1 Score: $\frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

Metrics are computed both at the atomic proposition (AP) level—assessing correctness for each output symbol at every timestep—and at the time step (TS) level, which requires all outputs at a timestep to be correct for credit. This dual-level granularity supports analytic breakdowns of partial versus complete reasoning successes.

4. Experimental Findings on LLM Reasoning

State-of-the-art LLMs, including GPT-4o and Claude variants, were evaluated on both tasks. Results are as follows:

TCE "Normal" regimen: F1 (TS) = 65.6%, F1 (AP) = 59.5%
TCE "Hard" regimen: F1 (TS) = 7.5%, F1 (AP) = 8.5%

Performance on TTE for both normal and hard sets is generally higher (above 50–60% F1), indicating that trace simulation is less challenging for current LLM architectures than multi-step causal identification. Notably, as automata complexity increases—through higher state counts, transition density, or deeper effect attribution—LLM performance degrades sharply, with TCE-hard essentially approaching chance levels. This demonstrates a severe limitation in current LLM temporal causal credit assignment.

Correlation and regression analyses (including random forest and SHAP-value diagnostics) indicate that system complexity, effect depth, and sparsity of causal inputs are principal drivers of difficulty. The variance in LLM F1 scores is explained by these features with coefficients of determination ( $R^2$ ) up to ~0.65.

5. Diagnostic and Interpretability Features

A distinguishing attribute of TempoBench is its capacity for interpretable deconstruction of LLM reasoning. By systematically varying structural properties across tasks, the benchmark allows researchers to pinpoint failure modes—such as inability to track long causal chains, confusion under high latent state, or loss of credit assignment under sparse causal input sets.

Feature importance analysis reveals that:

Feature	Effect on Difficulty (TCE)
State count	Higher → Lower F1
Transition count	Higher → Lower F1
Effect depth	Higher → Lower F1
Unique inputs	More → Easier for TTE, mixed for TCE

This level of analytic breakdown is unprecedented for LLM reasoning diagnostics, enabling targeted interventions in architecture, training protocol, or dataset curation.

6. Impact and Research Directions

TempoBench fills a currently unmet need for formally parametric, rigorously diagnostic benchmarks in LLM temporal reasoning. Unlike prior synthetic or mathematical datasets, its automata-theoretic framework supports:

Automated, reproducible ground truth generation at scale.
Gridable difficulty controls for fine-grained benchmarking.
Objective measurement of both trace simulation and deep causal credit assignment.

Experimental evidence confirms that multi-step temporal causality remains a critical challenge for leading LLMs, with hard tasks nearly unsolvable at present performance levels despite strong results on shallow or symbolic inference. The diagnostic nature of TempoBench suggests that future advances may require new architectures or training regimes specifically attuned to extended temporal and causal dependencies.

TempoBench code and dataset generation pipeline are publicly available (see: https://github.com/nik-hz/tempobench), supporting direct replication and extension for future research. Its adoption is poised to set new standards for evaluative rigor and analytical interpretability in temporal reasoning systems.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to TempoBench.