TIMEDIAL: Temporal Commonsense Reasoning in Dialog (2106.04571v1)

Published 8 Jun 2021 in cs.CL

Abstract: Everyday conversations require understanding everyday events, which in turn, requires understanding temporal commonsense concepts interwoven with those events. Despite recent progress with massive pre-trained LLMs (LMs) such as T5 and GPT-3, their capability of temporal reasoning in dialogs remains largely under-explored. In this paper, we present the first study to investigate pre-trained LMs for their temporal reasoning capabilities in dialogs by introducing a new task and a crowd-sourced English challenge set, TIMEDIAL. We formulate TIME-DIAL as a multiple-choice cloze task with over 1.1K carefully curated dialogs. Empirical results demonstrate that even the best performing models struggle on this task compared to humans, with 23 absolute points of gap in accuracy. Furthermore, our analysis reveals that the models fail to reason about dialog context correctly; instead, they rely on shallow cues based on existing temporal patterns in context, motivating future research for modeling temporal concepts in text and robust contextual reasoning about them. The dataset is publicly available at: https://github.com/google-research-datasets/timedial.

PDF Abstract

Temporal Commonsense Reasoning in Dialog: An Analysis of TimeDial

The paper "TimeDial: Temporal Commonsense Reasoning in Dialog" examines the temporal reasoning capabilities of LLMs (LMs) when faced with dialog scenarios, a context that has largely remained unexplored despite gaining significant attention with advancements in pre-trained LMs like T5 and GPT-3. The authors propose a unique task, establishing a new dataset called TimeDial to examine the efficacy of these models in understanding temporal commonsense within multi-turn dialogs. The paper involves assessing models based on their ability to infer temporal dynamics in conversations, by leveraging over 1.1K curated dialog scenarios.

Task Formulation and Dataset

TimeDial is formulated as a multiple-choice cloze task, tapping into scenarios that require temporal reasoning, and it's set up with a collection of over 1,100 dialogs derived from the DailyDialog corpus. The authors present this challenge with an emphasis on understanding both the direct and nuanced temporal implications embedded in dialog exchanges.

The dataset is meticulously curated to encapsulate dialogs rich in temporal expressions, which necessitate the utility of commonsense and temporal arithmetic reasoning to decipher contextually correct responses. The construction of TimeDial also includes strategically crafted negative examples to challenge models reliant on superficial cues and enforce genuine temporal understanding.

Experimental Framework

The paper evaluates state-of-the-art LMs through three common paradigms: binary classification, mask-filling, and generation. Each of these paradigms adapts the structure to treat the dialog as context with a corresponding completion task — essentially predicting the masked temporal information using the context.

To enhance these models' capabilities, the authors employ both in-domain (DailyDialog) and out-domain (Meena dataset) data for fine-tuning. The goal is to determine the impact of domain relevance and data volume on the models' ability to generalize temporal reasoning.

Findings

The experiments indicate that even the most proficient models struggle significantly, achieving a notable gap in performance compared to human annotators. Humans achieved a near-perfect accuracy of 97.8%, whereas the best-performing model, T5 Large finetuned on in-domain data, lagged at 74.8% accuracy.

One vital observation is that LMs showed susceptibility to distractors in the dialog that align closely in lexical or numerical content to the contextually correct answers, suggesting an over-reliance on pattern matching rather than robust reasoning.

Implications and Future Directions

The implications of this paper highlight that current LMs have not fully internalized temporal commonsense reasoning in dialog scenarios despite showing proficiency in other related tasks. This invites further research to evolve models capable of capturing a deeper understanding of the dynamics of time in conversational contexts.

The exploration of TimeDial opens additional avenues in model architecture adjustments to better handle temporal dynamics robustly. Also, there's a call to perhaps integrate more diverse forms of world knowledge into these systems to bolster their temporal reasoning capabilities holistically.

Conclusion

The research presented in TimeDial demonstrates significant limitations in state-of-the-art LMs in terms of temporal reasoning during dialog interactions. It serves as a pivotal resource for the community geared toward developing more competent AI systems in understanding temporal relationships in human-like conversation scenarios. Future efforts may focus on richer data representation and enhanced training strategies to bridge the discernible performance trio between machines and humans in temporal reasoning tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Lianhui Qin (35 papers)
Aditya Gupta (25 papers)
Shyam Upadhyay (22 papers)
Luheng He (20 papers)
Yejin Choi (287 papers)
Manaal Faruqui (39 papers)

Citations (61)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - google-research-datasets/TimeDial: Temporal Commonsense Reasoning in Dialog (68 stars)