Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models (2310.05157v1)

Published 8 Oct 2023 in cs.CL and cs.AI

Abstract: LLMs have shown nearly saturated performance on many NLP tasks. As a result, it is natural for people to believe that LLMs have also mastered abilities such as time understanding and reasoning. However, research on the temporal sensitivity of LLMs has been insufficiently emphasized. To fill this gap, this paper constructs Multiple Sensitive Factors Time QA (MenatQA), which encompasses three temporal factors (scope factor, order factor, counterfactual factor) with total 2,853 samples for evaluating the time comprehension and reasoning abilities of LLMs. This paper tests current mainstream LLMs with different parameter sizes, ranging from billions to hundreds of billions. The results show most LLMs fall behind smaller temporal reasoning models with different degree on these factors. In specific, LLMs show a significant vulnerability to temporal biases and depend heavily on the temporal information provided in questions. Furthermore, this paper undertakes a preliminary investigation into potential improvement strategies by devising specific prompts and leveraging external tools. These approaches serve as valuable baselines or references for future research endeavors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  4. A dataset for answering time-sensitive questions. arXiv preprint arXiv:2108.06314.
  5. Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282.
  6. Realtime qa: What’s the answer right now? arXiv preprint arXiv:2207.13332.
  7. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  8. Answering numerical reasoning questions in table-text hybrid contents with graph-based encoder and tree-based decoder. arXiv preprint arXiv:2209.07692.
  9. S3HQA: A three-stage approach for multi-hop text-table hybrid question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1731–1740, Toronto, Canada. Association for Computational Linguistics.
  10. Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios. arXiv preprint arXiv:2305.16572.
  11. Learning to imagine: Integrating counterfactual thinking in neural discrete reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 57–69.
  12. Streamingqa: A benchmark for adaptation to new knowledge over time in question answering models. In International Conference on Machine Learning, pages 13604–13622. PMLR.
  13. OpenAI. 2023. Gpt-4 technical report.
  14. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240.
  15. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  16. Towards benchmarking and improving the temporal reasoning capability of large language models.
  17. Learning to imagine: Visually-augmented natural language generation. arXiv preprint arXiv:2305.16944.
  18. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  19. Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6 billion parameter autoregressive language model.
  20. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  21. Multi-view graph representation learning for answering hybrid numerical reasoning question. arXiv preprint arXiv:2305.03458.
  22. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  23. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297.
  24. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  25. Michael JQ Zhang and Eunsol Choi. 2021. Situatedqa: Incorporating extra-linguistic contexts into qa. arXiv preprint arXiv:2109.06157.
  26. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  27. Are larger pretrained language models uniformly better? comparing performance at the instance level. arXiv preprint arXiv:2105.06020.
  28. Context-faithful prompting for large language models. arXiv preprint arXiv:2303.11315.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yifan Wei (20 papers)
  2. Yisong Su (3 papers)
  3. Huanhuan Ma (10 papers)
  4. Xiaoyan Yu (22 papers)
  5. Fangyu Lei (19 papers)
  6. Yuanzhe Zhang (20 papers)
  7. Jun Zhao (469 papers)
  8. Kang Liu (207 papers)
Citations (6)