A Comprehensive Evaluation on Event Reasoning of Large Language Models (2404.17513v2)
Abstract: Event reasoning is a fundamental ability that underlies many applications. It requires event schema knowledge to perform global reasoning and needs to deal with the diversity of the inter-event relations and the reasoning paradigms. How well LLMs accomplish event reasoning on various relations and reasoning paradigms remains unknown. To mitigate this disparity, we comprehensively evaluate the abilities of event reasoning of LLMs. We introduce a novel benchmark EV2 for EValuation of EVent reasoning. EV2 consists of two levels of evaluation of schema and instance and is comprehensive in relations and reasoning paradigms. We conduct extensive experiments on EV2. We find that LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory. We also notice the imbalance of event reasoning abilities in LLMs. Besides, LLMs have event schema knowledge, however, they're not aligned with humans on how to utilize the knowledge. Based on these findings, we guide the LLMs in utilizing the event schema knowledge as memory leading to improvements on event reasoning.
- Have llms advanced enough? a challenging problem solving benchmark for large language models. arXiv preprint arXiv:2305.15074.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
- Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. arXiv preprint arXiv:2303.16421.
- CausalQA: A benchmark for causal question answering. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3296–3308, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Tommaso Caselli and Piek Vossen. 2017. The event storyline corpus: A new benchmark for causal and temporal relation extraction. In Proceedings of the Events and Stories in the News Workshop, pages 77–86.
- An annotation framework for dense event ordering. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 501–506.
- Chatgpt evaluation on sentence level relations: A focus on temporal, causal, and discourse relations. arXiv preprint arXiv:2304.14827.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- e-care: a new dataset for exploring explainable causal reasoning. arXiv preprint arXiv:2205.05849.
- Exploring the feasibility of chatgpt for event extraction. arXiv preprint arXiv:2303.03836.
- Hieve: A corpus for extracting event hierarchies from news stories. In Proceedings of 9th language resources and evaluation conference, pages 3678–3683. ELRA.
- Mark Granroth-Wilding. 2016. What happens next? event prediction using a compositional neural network model. In AAAI Conference on Artificial Intelligence.
- Story ending generation with incremental encoding and commonsense knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6473–6480.
- Ester: A machine reading comprehension dataset for reasoning about event semantic relations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7543–7559.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Christopher Hidey. 2016. Identifying causal relations using parallel Wikipedia articles. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1424–1433, Berlin, Germany. Association for Computational Linguistics.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- The future is not one-dimensional: Complex event schema induction by graph modeling for event prediction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5203–5215.
- Causality extraction based on self-attentive bilstm-crf with transferred embeddings. Neurocomputing, 423:207–219.
- Leveraging large language models for nlg evaluation: A survey. arXiv preprint arXiv:2401.07103.
- Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688.
- Event prediction based on evolutionary event ontology knowledge. Future Generation Computer Systems, 115:76–89.
- Alexis Mitchell. 2005. The automatic content extraction (ace) program-tasks, data, and evaluation.
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849.
- A multi-axis annotation scheme for event temporal relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1318–1328.
- Richer event description: Integrating event coreference with temporal, causal and bridging annotation. In Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016), pages 47–56, Austin, Texas. Association for Computational Linguistics.
- Recognizing emotion cause in conversations. Cognitive Computation, 13:1317–1332.
- Counterfactual story reasoning and generation. arXiv preprint arXiv:1909.04076.
- Back to the future: Unsupervised backprop-based decoding for counterfactual and abductive commonsense reasoning. arXiv preprint arXiv:2010.05906.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
- Atomic: An atlas of machine commonsense for if-then reasoning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3027–3035.
- Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728.
- Event-qa: A dataset for event-centric question answering over knowledge graphs. In Proceedings of the 29th ACM international conference on information & knowledge management, pages 3157–3164.
- Eveval: A comprehensive evaluation of event semantics for large language models. arXiv preprint arXiv:2305.15268.
- Unievent: Unified generative model with multi-dimensional prefix for zero-shot event-relational reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7088–7102.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Zeno Vendler. 1957. Verbs and times. The philosophical review, pages 143–160.
- Maven-ere: A unified large-scale dataset for event coreference, temporal, causal, and subevent relation extraction. arXiv preprint arXiv:2211.07342.
- Ecckg: An eventuality-centric commonsense knowledge graph. In International Conference on Knowledge Science, Engineering and Management, pages 568–584. Springer.
- Zero-shot information extraction via chatting with chatgpt. arXiv preprint arXiv:2302.10205.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Re-reading improves reasoning in language models. arXiv preprint arXiv:2309.06275.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- A temporal semantic search system for traditional chinese medicine based on temporal knowledge graphs. In Semantic Technology: 9th Joint International Conference, JIST 2019, Hangzhou, China, November 25–27, 2019, Revised Selected Papers 9, pages 13–20. Springer.
- Towards fine-grained causal reasoning and qa. arXiv preprint arXiv:2204.07408.
- Zero-shot temporal relation extraction with chatgpt. arXiv preprint arXiv:2304.05454.
- Aser: A large-scale eventuality knowledge graph. In Proceedings of the web conference 2020, pages 201–211.
- Liang Zhao. 2021. Event prediction in the big data era: A systematic survey. ACM Computing Surveys (CSUR), 54(5):1–37.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
- " going on a vacation" takes longer than" going for a walk": A study of temporal commonsense understanding. arXiv preprint arXiv:1909.03065.
- Temporal reasoning on implicit events from distant supervision. arXiv preprint arXiv:2010.12753.