Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Formulation Comparison for Timeline Construction using LLMs (2403.00990v1)

Published 1 Mar 2024 in cs.CL

Abstract: Constructing a timeline requires identifying the chronological order of events in an article. In prior timeline construction datasets, temporal orders are typically annotated by either event-to-time anchoring or event-to-event pairwise ordering, both of which suffer from missing temporal information. To mitigate the issue, we develop a new evaluation dataset, TimeSET, consisting of single-document timelines with document-level order annotation. TimeSET features saliency-based event selection and partial ordering, which enable a practical annotation workload. Aiming to build better automatic timeline construction systems, we propose a novel evaluation framework to compare multiple task formulations with TimeSET by prompting open LLMs, i.e., Llama 2 and Flan-T5. Considering that identifying temporal orders of events is a core subtask in timeline construction, we further benchmark open LLMs on existing event temporal ordering datasets to gain a robust understanding of their capabilities. Our experiments show that (1) NLI formulation with Flan-T5 demonstrates a strong performance among others, while (2) timeline construction and event temporal ordering are still challenging tasks for few-shot LLMs. Our code and data are available at https://github.com/kimihiroh/timeset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Sarah Alsayyahi and Riza Batista-Navarro. 2023. TIMELINE: Exhaustive annotation of temporal relations supporting the automatic ordering of events in news articles. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16336–16348, Singapore. Association for Computational Linguistics.
  2. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2895–2905, Florence, Italy. Association for Computational Linguistics.
  3. Annotating story timelines as temporal dependency structures. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2721–2726, Istanbul, Turkey. European Language Resources Association (ELRA).
  4. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Tommaso Caselli and Piek Vossen. 2017. The event StoryLine corpus: A new benchmark for causal and temporal relation extraction. In Proceedings of the Events and Stories in the News Workshop, pages 77–86, Vancouver, Canada. Association for Computational Linguistics.
  7. An annotation framework for dense event ordering. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 501–506, Baltimore, Maryland. Association for Computational Linguistics.
  8. Nathanael Chambers. 2017. Behind the scenes of an evolving event cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 41–45, Valencia, Spain. Association for Computational Linguistics.
  9. Nathanael Chambers and Daniel Jurafsky. 2008. Jointly combining implicit constraints improves temporal ordering. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 698–706, Honolulu, Hawaii. Association for Computational Linguistics.
  10. Chatgpt evaluation on sentence level relations: A focus on temporal, causal, and discourse relations. arXiv preprint arXiv:2304.14827.
  11. Event-centric natural language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Tutorial Abstracts, pages 6–14, Online. Association for Computational Linguistics.
  12. Fei Cheng and Yusuke Miyao. 2018. Inducing temporal relations from time anchor annotation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1833–1843, New Orleans, Louisiana. Association for Computational Linguistics.
  13. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  14. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR.
  15. Supervised relation classification as twoway span-prediction. In 4th Conference on Automated Knowledge Base Construction.
  16. Faith and fate: Limits of transformers on compositionality. In Thirty-seventh Conference on Neural Information Processing Systems.
  17. Generic temporal reasoning with differential analysis and explanation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12013–12029, Toronto, Canada. Association for Computational Linguistics.
  18. A framework for few-shot language model evaluation.
  19. Towards understanding gender bias in relation extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2943–2953, Online. Association for Computational Linguistics.
  20. Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037.
  21. Cross-lingual information extraction and automated text summarization. Multilingual information management: current levels and future abilities, page 14.
  22. Joint event and temporal relation extraction with shared representations and structured prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 434–444, Hong Kong, China. Association for Computational Linguistics.
  23. DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
  24. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  25. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  26. Understand before answer: Improve temporal reading comprehension via precise question understanding. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 375–384, Seattle, United States. Association for Computational Linguistics.
  27. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907, Online. Association for Computational Linguistics.
  28. Evaluating chatgpt’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness. arXiv preprint arXiv:2304.11633.
  29. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  30. Conditional generation of temporally-ordered event sequences. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7142–7157, Online. Association for Computational Linguistics.
  31. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  32. Ro{bert}a: A robustly optimized {bert} pretraining approach.
  33. Automatic event salience identification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1226–1236, Brussels, Belgium. Association for Computational Linguistics.
  34. SemEval-2015 task 5: QA TempEval - evaluating temporal information understanding with question answering. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 792–800, Denver, Colorado. Association for Computational Linguistics.
  35. The flan collection: Designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  36. Large language model is not a good few-shot information extractor, but a good reranker for hard samples! In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10572–10601, Singapore. Association for Computational Linguistics.
  37. Few-shot event detection: An empirical study and a unified view. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11211–11236, Toronto, Canada. Association for Computational Linguistics.
  38. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1384–1403, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  39. Selecting optimal context sentences for event-event relation extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11058–11066.
  40. TIMERS: Document-level temporal relation extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 524–533, Online. Association for Computational Linguistics.
  41. DocTime: A document-level temporal dependency graph parser. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 993–1009, Seattle, United States. Association for Computational Linguistics.
  42. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  43. SemEval-2015 task 4: TimeLine: Cross-document event ordering. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 778–786, Denver, Colorado. Association for Computational Linguistics.
  44. MEANTIME, the NewsReader multilingual event and time corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4417–4422, Portorož, Slovenia. European Language Resources Association (ELRA).
  45. CaTeRS: Causal and temporal relation scheme for semantic annotation of event structures. In Proceedings of the Fourth Workshop on Events, pages 51–61, San Diego, California. Association for Computational Linguistics.
  46. TDDiscourse: A dataset for discourse-level temporal ordering of events. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pages 239–249, Stockholm, Sweden. Association for Computational Linguistics.
  47. An improved neural baseline for temporal relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6203–6209, Hong Kong, China. Association for Computational Linguistics.
  48. TORQUE: A reading comprehension dataset of temporal ordering questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1158–1172, Online. Association for Computational Linguistics.
  49. A multi-axis annotation scheme for event temporal relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1318–1328, Melbourne, Australia. Association for Computational Linguistics.
  50. OpenAI. 2022. Chatgpt blog post.
  51. OpenAI. 2023. Gpt4 technical report.
  52. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
  53. The Proposition Bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106.
  54. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
  55. Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 67–81, Brussels, Belgium. Association for Computational Linguistics.
  56. Timeml: Robust specification of event and temporal expressions in text. New directions in question answering, 3:28–34.
  57. The timebank corpus. In Corpus linguistics, volume 2003, page 40. Lancaster, UK.
  58. Are large language models temporally grounded? arXiv preprint arXiv:2311.08398.
  59. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  60. Temporal anchoring of events for the TimeBank corpus. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2195–2204, Berlin, Germany. Association for Computational Linguistics.
  61. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Computing Surveys, 55(10):1–45.
  62. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  63. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore. Association for Computational Linguistics.
  64. proScript: Partially ordered scripts generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2138–2149, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  65. Abhilasha Sancheti and Rachel Rudinger. 2022. What do large language models learn about scripts? In Proceedings of the 11th Joint Conference on Lexical and Computational Semantics, pages 1–11, Seattle, Washington. Association for Computational Linguistics.
  66. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  67. Tieval: An evaluation framework for temporal information extraction systems. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 2871–2879, New York, NY, USA. Association for Computing Machinery.
  68. brat: a web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 102–107, Avignon, France. Association for Computational Linguistics.
  69. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada. Association for Computational Linguistics.
  70. Eveval: A comprehensive evaluation of event semantics for large language models. arXiv preprint arXiv:2305.15268.
  71. Newsreader guidelines for annotation at document level nwr-2014-2-2.
  72. Are fairy tales fair? analyzing gender bias in temporal narrative event chains of children’s fairy tales. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6509–6531, Toronto, Canada. Association for Computational Linguistics.
  73. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  74. Naushad UzZaman and James Allen. 2011. Temporal evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 351–356, Portland, Oregon, USA. Association for Computational Linguistics.
  75. SemEval-2013 task 1: TempEval-3: Evaluating time expressions, events, and temporal relations. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9, Atlanta, Georgia, USA. Association for Computational Linguistics.
  76. Temporal reasoning in natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4070–4078, Online. Association for Computational Linguistics.
  77. Revisiting relation extraction in the era of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15566–15589, Toronto, Canada. Association for Computational Linguistics.
  78. MAVEN-ERE: A unified large-scale dataset for event coreference, temporal, causal, and subevent relation extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 926–941, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  79. Code4Struct: Code generation for few-shot event structure prediction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3640–3663, Toronto, Canada. Association for Computational Linguistics.
  80. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  81. How to unleash the power of large language models for few-shot relation extraction? In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 190–200, Toronto, Canada (Hybrid). Association for Computational Linguistics.
  82. Annotating Temporal Dependency Graphs via Crowdsourcing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5368–5380, Online. Association for Computational Linguistics.
  83. Zero-shot temporal relation extraction with ChatGPT. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 92–102, Toronto, Canada. Association for Computational Linguistics.
  84. Causal reasoning of entities and events in procedural texts. In Findings of the Association for Computational Linguistics: EACL 2023, pages 415–431, Dubrovnik, Croatia. Association for Computational Linguistics.
  85. “going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3363–3369, Hong Kong, China. Association for Computational Linguistics.
  86. RSGT: Relational structure guided temporal relation extraction. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2001–2010, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  87. Dimensions of situation model construction in narrative comprehension. Journal of experimental psychology: Learning, memory, and cognition, 21:386.

Summary

We haven't generated a summary for this paper yet.