Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning (2406.09170v1)

Published 13 Jun 2024 in cs.CL

Abstract: LLMs have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks. To foster further research in this area, we are open-sourcing the datasets and evaluation framework used in our experiments: https://huggingface.co/datasets/baharef/ToT.

PDF HTML Abstract

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

The paper "Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning," authored by a team from Google Research, Google DeepMind, and other Google divisions, addresses a crucial issue in the evaluation of LLMs. Specifically, it focuses on the limitations of existing benchmarks in assessing LLMs' capabilities in temporal reasoning, an essential aspect of AI systems operating across various domains.

Overview of Temporal Reasoning Challenges

The paper begins by highlighting the progress in LLM research, mentioning significant models like BERT, GPT-4, and others. However, it identifies a gap in temporal reasoning benchmarks, which typically depend on knowledge graph-style temporal facts. These benchmarks often reflect a model's capacity to leverage prior knowledge rather than genuine temporal reasoning abilities.

Contributions of the Paper

To bridge this gap, the paper introduces the "Test of Time" (ToT) benchmark, which comprises synthetic datasets engineered to evaluate LLMs on a broad spectrum of temporal reasoning tasks. The datasets are designed to isolate the impact of problem structure, size, question type, and fact order on LLM performance.

The main contributions include:

Synthetic Datasets for Comprehensive Evaluation: The paper introduces synthetic datasets for unbiased assessment, free from contamination by pre-existing real-world data. This approach ensures that the models cannot exploit parametric knowledge.
Separation of Temporal Semantics and Arithmetic Reasoning: The ToT benchmark splits tasks into ToT-Semantic and ToT-Arithmetic. ToT-Semantic focuses on understanding the semantics and logic of time, while ToT-Arithmetic evaluates the ability to perform calculations involving time points and durations.
Open-Sourcing of Resources: The authors have made the datasets and evaluation framework available on Hugging Face, promoting further research and replication of results.

Detailed Analysis of Temporal Tasks

ToT-Semantic: Assessing Temporal Semantics

The ToT-Semantic dataset includes questions derived from graph structures generated by various algorithms (e.g., Erdős-Rényi, Scale-Free Networks, Barabási–Albert, Stochastic Block Model). This ensures diversity in temporal relations and question complexity. Questions cover types like "EventAtTimeT," "EventAtWhatTime," and others, probing different aspects of temporal understanding.

For instance, graphs are generated using algorithms that result in various structures, from sparse to dense, ensuring a comprehensive test environment. The questions require models to reason over these graphs, assessing their ability to isolate temporal logic from the noise.

ToT-Arithmetic: Focusing on Temporal Calculations

ToT-Arithmetic is crowd-sourced and centers on practical arithmetic tasks involving time. This includes tasks like "AddSubtract," "Compare," "Duration," and "Timezone," each category designed to probe specific arithmetic skills. The generation process involved creating a seed set of questions, which was expanded by annotators to cover a wide range of time-related calculations.

For example, a "Duration" task might ask for the difference in days between two dates, requiring precise arithmetic that challenges models on fundamental counting skills.

Experimental Results and Analysis

The paper evaluates three state-of-the-art LLMs: Claude-3-Sonnet, GPT-4, and Gemini 1.5 Pro. The findings reveal several key insights:

Impact of Temporal Structure: The paper demonstrates that different graph structures significantly affect model performance. For instance, GPT-4's accuracy varied widely across different graph types, indicating that some structures pose more challenges than others.
Question Type Difficulty: It was observed that different types of temporal questions show varying difficulty levels for LLMs. Simple retrieval questions like "EventAtWhatTime" are generally easier, while complex multi-fact questions like "Timeline" are more challenging.
Order of Facts: The paper found that the order in which facts are presented to the models can significantly affect their performance. Specifically, prioritizing facts by the target entity and start time (TargetAndStartTime) provided the best results.
Arithmetic Challenges: ToT-Arithmetic tasks revealed that while LLMs perform well on straightforward arithmetic operations, they struggle with complex time calculations involving multiple steps or intricate logic.

Implications and Future Directions

The findings from this paper have several implications for the development and application of LLMs:

Improvement of Temporal Reasoning: The benchmark provides a controlled environment to test and improve temporal reasoning in LLMs, a critical capability for many real-world applications.
Training and Fine-tuning: Insights from the benchmark can guide the training and fine-tuning of LLMs, enabling better performance on time-sensitive tasks.
Benchmark Development: The approach used in ToT can inspire the development of other benchmarks that require disentangling specific reasoning capabilities from parametric knowledge.

Conclusion

"Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning" represents a significant step towards a more nuanced understanding of LLM capabilities. By focusing on synthetic datasets and separating semantic from arithmetic reasoning, the authors provide a robust framework for future AI research. The open sourcing of their work paves the way for ongoing advancements in the field, promising improved AI systems capable of sophisticated temporal reasoning.