TRAVELER: A Benchmark for Evaluating Temporal Reasoning across Vague, Implicit and Explicit References (2505.01325v1)

Published 2 May 2025 in cs.CL

Abstract: Understanding and resolving temporal references is essential in Natural Language Understanding as we often refer to the past or future in daily communication. Although existing benchmarks address a system's ability to reason about and resolve temporal references, systematic evaluation of specific temporal references remains limited. Towards closing this gap, we introduce TRAVELER, a novel synthetic benchmark dataset that follows a Question Answering paradigm and consists of questions involving temporal references with the corresponding correct answers. TRAVELER assesses models' abilities to resolve explicit, implicit relative to speech time, and vague temporal references. Beyond investigating the performance of state-of-the-art LLMs depending on the type of temporal reference, our benchmark also allows evaluation of performance in relation to the length of the set of events. For the category of vague temporal references, ground-truth answers were established via human surveys on Prolific, following a procedure similar to the one from Kenneweg et al. To demonstrate the benchmark's applicability, we evaluate four state-of-the-art LLMs using a question-answering task encompassing 3,300 questions. Our findings show that while the benchmarked LLMs can answer questions over event sets with a handful of events and explicit temporal references successfully, performance clearly deteriorates with larger event set length and when temporal references get less explicit. Notably, the vague question category exhibits the lowest performance across all models. The benchmark is publicly available at: https://gitlab.ub.uni-bielefeld.de/s.kenneweg/TRAVELER

PDF Abstract

Evaluating Temporal Reasoning with TRAVELER Benchmark

The paper introduces TRAVELER, a robust benchmark designed to evaluate temporal reasoning capabilities of AI models, specifically targeting implicit, explicit, and vague temporal references. The research highlights the complexities inherent in temporal reasoning, emphasizing the varying degrees of temporal explicitness and the size of event sets as critical factors influencing model performance in natural language understanding applications.

Overview of TRAVELER Benchmark

TRAVELER is formulated as a synthetic benchmark, modeled along a Question Answering paradigm, comprising 3,300 questions against everyday household events. It distinguishes three distinct categories of temporal references: explicit, implicit relative to speech time, and vague, each representing a unique challenge in temporal reasoning. The benchmark enables a systematic evaluation by not only assessing resolution capabilities but also exploring performance degradation as event set sizes increase from five to one hundred events.

Experimental Design and Evaluation

The authors propose hypotheses concerning decreasing levels of explicitness, performance disparity between vague and explicit references, and the impact of event set length on reasoning tasks. Utilizing a diverse collection of state-of-the-art LLMs—Gemma-7b-it, Llama3-8B-Instruct, Llama3-70B-Instruct, and GPT-4—the paper performs an empirical analysis across various prompts and question categories. Through comprehensive survey methodologies, ground-truth answers for vague temporal references are probabilistically determined, ensuring an objective judgement across ambiguously defined queries.

Numerical Findings

Across all tested LLMs, empirical results reveal significant performance variability relating to temporal explicitness and event set length. With explicit temporal references, models demonstrate higher accuracy, ranging from 75% to 92%. Meanwhile, implicit temporal references relative to speech time manifest a performance decline of approximately 26%, with accuracy spanning 34% to 74%. Vague reference handling poses the greatest challenge, with accuracy ranging between 26% and 45%. In all scenarios, the largest model, Llama3-70B-Instruct, generally outperformed others, suggesting a correlation between model scale and reasoning capabilities.

Prompt Engineering Insights

Prompts based on CoT reasoning, particularly those employing a step-by-step approach, consistently yielded superior performance across event set lengths. Conversely, augmenting date information in prompts did not significantly boost accuracy, implying the models' inefficacy in leveraging such supplementary temporal granularity. Presenting events in natural language over structured JSON format also enhanced performance, underlining the potential training bias favoring conventional language data.

Implications and Future Directions

The paper delineates the profound challenges in temporal reasoning, advocating enhanced memory mechanisms or explicit reasoning modules within LLM frameworks. Furthermore, the development of automated prompt engineering tools can systematically refine model responses, addressing deficiencies in handling vague temporal queries. Extending the benchmark to encompass broader temporal categories and more complex event reasoning scenarios represents another frontier in this domain.

In summary, TRAVELER presents a detailed, systematic approach for evaluating key temporal reasoning dimensions inherent in AI models, offering valuable metrics for further refinement and innovation within the field of large-scale LLMing.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Svenja Kenneweg (2 papers)
Jörg Deigmöller (4 papers)
Philipp Cimiano (25 papers)
Julian Eggert (23 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos