"I'd Like to Have an Argument, Please": Argumentative Reasoning in Large Language Models (2309.16938v2)
Abstract: We evaluate two LLMs ability to perform argumentative reasoning. We experiment with argument mining (AM) and argument pair extraction (APE), and evaluate the LLMs' ability to recognize arguments under progressively more abstract input and output (I/O) representations (e.g., arbitrary label sets, graphs, etc.). Unlike the well-known evaluation of prompt phrasings, abstraction evaluation retains the prompt's phrasing but tests reasoning capabilities. We find that scoring-wise the LLMs match or surpass the SOTA in AM and APE, and under certain I/O abstractions LLMs perform well, even beating chain-of-thought--we call this symbolic prompting. However, statistical analysis on the LLMs outputs when subject to small, yet still human-readable, alterations in the I/O representations (e.g., asking for BIO tags as opposed to line numbers) showed that the models are not performing reasoning. This suggests that LLM applications to some tasks, such as data labelling and paper reviewing, must be done with care.
- 2022. Exploring length generalization in large language models. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
- 2013. Abstract Meaning Representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, 178–186. Sofia, Bulgaria: Association for Computational Linguistics.
- 2022. Have my arguments been replied to? Argument pair extraction as machine reading comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 29–35. Dublin, Ireland: Association for Computational Linguistics.
- 2020. Longformer: The long-document transformer. ArXiv abs/2004.05150.
- 2004. Knowledge Representation and Reasoning. Morgan-Kaufmann.
- 2021. Learning to rationalize for nonmonotonic reasoning with distant supervision. Proceedings of the AAAI Conference on Artificial Intelligence 35(14):12592–12601.
- 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NeurIPS’20. Red Hook, NY, USA: Curran Associates Inc.
- 2023. Quantifying memorization across neural language models. In International Conference on Learning Representations.
- 2022. It’s not Rocket Science: Interpreting Figurative Language in Narratives. Transactions of the Association for Computational Linguistics 10:589–606.
- 2020. APE: Argument pair extraction from peer review and rebuttal via multi-task learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7000–7011. Online: Association for Computational Linguistics.
- 2021. Argument pair extraction via attention-guided multi-layer multi-cross encoding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 6341–6353. Online: Association for Computational Linguistics.
- 2023. Is GPT-4 a good data analyst? ArXiv abs/2305.15038.
- 2017. Deep reinforcement learning from human preferences. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- 2023. Using planning to improve semantic parsing of instructional texts. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), 47–58. Toronto, Canada: Association for Computational Linguistics.
- Cong, Y. 2022. Psycholinguistic diagnosis of language models’ commonsense reasoning. In Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), 17–22. Dublin, Ireland: Association for Computational Linguistics.
- 2022. Understanding robust generalization in learning regular languages. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; and Sabato, S., eds., Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, 4630–4643. PMLR.
- 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
- 2023. Faith and fate: Limits of transformers on compositionality. ArXiv abs/2305.18654.
- 2020. Social chemistry 101: Learning to reason about social and moral norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 653–670. Online: Association for Computational Linguistics.
- 2022. Human-like property induction is a challenge for large language models. In Proceedings of the 44th Annual Conference of the Cognitive Science Society (CogSci 2022), 1–8. Toronto, Canada: Annual Conference of the Cognitive Science Society (CogSci).
- 2022. Fair and argumentative language modeling for computational argumentation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 7841–7861. Dublin, Ireland: Association for Computational Linguistics.
- 2022. When to make exceptions: Exploring language models as accounts of human moral judgment. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 28458–28473. Curran Associates, Inc.
- 2022. Large language models are zero-shot reasoners. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 22199–22213. Curran Associates, Inc.
- 2022. Scientia Potentia Est—On the Role of Knowledge in Computational Argumentation. Transactions of the Association for Computational Linguistics 10:1392–1422.
- 2020. Argument Mining: A Survey. Computational Linguistics 45(4):765–818.
- 2023. Do language models plagiarize? In Proceedings of the ACM Web Conference 2023, WWW ’23, 3637–3647. New York, NY, USA: Association for Computing Machinery.
- 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880. Online: Association for Computational Linguistics.
- 2022. Solving quantitative reasoning problems with language models. ArXiv abs/2206.14858.
- 2022. Holistic evaluation of language models. ArXiv abs/2211.09110.
- 2023. ReviewerGPT? an exploratory study on using large language models for paper reviewing. ArXiv abs/2306.00622.
- 2022. Generated knowledge prompting for commonsense reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 3154–3169. Dublin, Ireland: Association for Computational Linguistics.
- 2023. G-Eval: NLG evaluation using GPT-4 with better human alignment. ArXiv abs/2303.16634.
- 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8086–8098. Dublin, Ireland: Association for Computational Linguistics.
- Open AI. 2023. GPT-4 technical report. Technical report, Open AI.
- 2021. Explainable unsupervised argument similarity rating with Abstract Meaning Representation and conclusion generation. In Proceedings of the 8th Workshop on Argument Mining, 24–35. Punta Cana, Dominican Republic: Association for Computational Linguistics.
- 2022. Training language models to follow instructions with human feedback. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 27730–27744. Curran Associates, Inc.
- 2023. ClarifyDelphi: Reinforced clarification questions with defeasibility rewards for social and moral situations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 11253–11271. Toronto, Canada: Association for Computational Linguistics.
- 2020. Thinking like a skeptic: Defeasible inference in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020, 4661–4675. Online: Association for Computational Linguistics.
- 2022. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
- 2023. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In International Conference on Learning Representations (ICLR).
- 2023. Can in-context learners learn a reasoning concept from demonstrations? In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), 107–115. Toronto, Canada: Association for Computational Linguistics.
- 2023. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, 13003–13051. Toronto, Canada: Association for Computational Linguistics.
- 2022. Will it blend? mixing training paradigms & prompting for argument quality prediction. In Proceedings of the 9th Workshop on Argument Mining, 95–103. Online and in Gyeongju, Republic of Korea: International Conference on Computational Linguistics.
- Walton, D. 2008. Informal Logic: A Pragmatic Approach. Cambridge University Press.
- 2022. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2300–2344. Seattle, United States: Association for Computational Linguistics.
- 2022a. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- 2022b. Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 24824–24837. Curran Associates, Inc.
- 2023. An evaluation on large language model outputs: Discourse and memorization. Natural Language Processing Journal 4:100024.
- 2023. On the paradox of learning to reason from data. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI).
- 2023. A new method using LLMs for keypoints generation in qualitative data analysis. In 2023 IEEE Conference on Artificial Intelligence (CAI), 333–334.
- 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv abs/2306.05685.
- 2023. Solving math word problems concerning systems of equations with gpt-3. Proceedings of the AAAI Conference on Artificial Intelligence 37(13):15972–15979.
- Adrian de Wynter (20 papers)
- Tangming Yuan (1 paper)