Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models (2306.08952v2)

Published 15 Jun 2023 in cs.CL and cs.AI

Abstract: Reasoning about time is of fundamental importance. Many facts are time-dependent. For example, athletes change teams from time to time, and different government officials are elected periodically. Previous time-dependent question answering (QA) datasets tend to be biased in either their coverage of time spans or question types. In this paper, we introduce a comprehensive probing dataset \tempreason to evaluate the temporal reasoning capability of LLMs. Our dataset includes questions of three temporal reasoning levels. In addition, we also propose a novel learning framework to improve the temporal reasoning capability of LLMs, based on temporal span extraction and time-sensitive reinforcement learning. We conducted experiments in closed book QA, open book QA, and reasoning QA settings and demonstrated the effectiveness of our approach. Our code and data are released on https://github.com/DAMO-NLP-SG/TempReason.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. SemEval-2016 task 12: Clinical TempEval. In Proceedings of SemEval.
  2. Translating embeddings for modeling multi-relational data. In Proceedings of NIPS.
  3. An annotation framework for dense event ordering. In Proceedings of ACL.
  4. A dataset for answering time-sensitive questions. In Proceedings of NIPS.
  5. A dataset for hyper-relational extraction and a cube-filling approach. In Proceedings of EMNLP.
  6. HyTE: Hyperplane-based temporally aware knowledge graph embedding. In Proceedings of EMNLP.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL.
  8. Time-Aware Language Models as Temporal Knowledge Bases. Transactions of ACL.
  9. Are large language models good data annotators? a study on gpt-3, chatgpt and gpt-4. In Proceedings of ACL.
  10. Reasoning implicit sentiment with chain-of-thought prompting. In Proceedings of ACL.
  11. Retrieval augmented language model pre-training. In Proceedings of ICML.
  12. Joint event and temporal relation extraction with shared representations and structured prediction. In Proceedings of EMNLP.
  13. Tempquestions: A benchmark for temporal question answering. In Proceedings of WWW.
  14. Tequila: Temporal question answering over knowledge bases. In Proceedings of CIKM.
  15. Complex temporal question answering on knowledge graphs. In Proceedings of CIKM.
  16. Towards time-aware knowledge graph completion. In Proceedings of COLING.
  17. Dense passage retrieval for open-domain question answering. In Proceedings of EMNLP.
  18. Realtime qa: What’s the answer right now? In Proceedings of EMNLP.
  19. Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of ICLR.
  20. Natural questions: a benchmark for question answering research. Transactions of ACL.
  21. Tensor decompositions for temporal knowledge base completion. In Proceedings of ICLR.
  22. Streamingqa: A benchmark for adaptation to new knowledge over time in question answering models. In Proceedings of ICML.
  23. Roberta: A robustly optimized bert pretraining approach. In arXiv preprint arXiv:1907.11692.
  24. Tempoqr: temporal question reasoning over knowledge graphs. In Proceedings of AAAI, volume 36, pages 5825–5833.
  25. TORQUE: A reading comprehension dataset of temporal ordering questions. In Proceedings of EMNLP.
  26. Training language models to follow instructions with human feedback. In arXiv preprint arXiv:2203.02155.
  27. The timebank corpus. In Corpus linguistics.
  28. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
  29. Hossein Rajaby Faghihi and Parisa Kordjamshidi. 2021. Time-stamped language model: Teaching language models to understand the flow of events. In Proceedings of NAACL.
  30. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP.
  31. How much knowledge can you pack into the parameters of a language model? In Proceedings of EMNLP.
  32. Question answering over temporal knowledge graphs. In Proceedings of ACL.
  33. Proximal policy optimization algorithms. In arXiv preprint arXiv:1707.06347.
  34. Improving time sensitivity for question answering over temporal knowledge graphs. In Proceedings of ACL. Association for Computational Linguistics.
  35. SemEval-2010 task 13: TempEval-2. In Proceedings of SemEval.
  36. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Proceedings of CACM.
  37. Knowledge graph embedding by translating on hyperplanes. In Proceedings of AAAI.
  38. Finetuned language models are zero-shot learners. In Proceedings of ICLR.
  39. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium.
  40. Multi-source test-time adaptation as dueling bandits for extractive question answering. In Proceedings of ACL.
  41. Michael Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating extra-linguistic contexts into QA. In Proceedings of EMNLP.
  42. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. In Proceedings of ACL.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Qingyu Tan (9 papers)
  2. Hwee Tou Ng (44 papers)
  3. Lidong Bing (144 papers)
Citations (20)
Github Logo Streamline Icon: https://streamlinehq.com