Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluation of an LLM in Identifying Logical Fallacies: A Call for Rigor When Adopting LLMs in HCI Research (2404.05213v1)

Published 8 Apr 2024 in cs.HC and cs.AI

Abstract: There is increasing interest in the adoption of LLMs in HCI research. However, LLMs may often be regarded as a panacea because of their powerful capabilities with an accompanying oversight on whether they are suitable for their intended tasks. We contend that LLMs should be adopted in a critical manner following rigorous evaluation. Accordingly, we present the evaluation of an LLM in identifying logical fallacies that will form part of a digital misinformation intervention. By comparing to a labeled dataset, we found that GPT-4 achieves an accuracy of 0.79, and for our intended use case that excludes invalid or unidentified instances, an accuracy of 0.90. This gives us the confidence to proceed with the application of the LLM while keeping in mind the areas where it still falls short. The paper describes our evaluation approach, results and reflections on the use of the LLM for our intended task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. B. Bennett. 2012. Logically Fallacious: The Ultimate Collection of Over 300 Logical Fallacies (Academic Edition). Ebookit.com.
  2. The Search for Agreement on Logical Fallacy Annotation of an Infodemic. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 4430–4438. https://aclanthology.org/2022.lrec-1.471
  3. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 159, 25 pages.
  4. Deconstructing climate misinformation to identify reasoning errors. Environmental Research Letters 13, 2 (feb 2018), 024018. https://doi.org/10.1088/1748-9326/aaa49f
  5. LLMs Accelerate Annotation for Medical Information Extraction. In Proceedings of the 3rd Machine Learning for Health Symposium (Proceedings of Machine Learning Research, Vol. 225), Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh (Eds.). PMLR, 82–100. https://proceedings.mlr.press/v225/goel23a.html
  6. Metrics for Multi-Class Classification: an Overview. arXiv:2008.05756 [stat.ML]
  7. TabLLM: Few-shot Classification of Tabular Data with Large Language Models. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 206), Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent (Eds.). PMLR, 5549–5581. https://proceedings.mlr.press/v206/hegselmann23a.html
  8. Timon M J Hruschka and Markus Appel. 2023. Learning about informal fallacies and the detection of fake news: An experimental intervention. PLoS One 18, 3 (March 2023), e0283238.
  9. Logical Fallacy Detection. In Findings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 7180–7198. https://doi.org/10.18653/v1/2022.findings-emnlp.532
  10. Trustworthy AI: From Principles to Practices. ACM Comput. Surv. 55, 9, Article 177 (jan 2023), 46 pages. https://doi.org/10.1145/3555803
  11. Jiyi Li. 2024. A Comparative Study on Annotation Quality of Crowdsourcing and LLM via Label Aggregation. arXiv:2401.09760 [cs.CL]
  12. Morgan Lundy. 2023. TikTok and COVID-19 Vaccine Misinformation: New Avenues for Misinformation Spread, Popular Infodemic Topics, and Dangerous Logical Fallacies. International Journal of Communication 17, 0 (2023). https://ijoc.org/index.php/ijoc/article/view/19847
  13. Elena Musi and Chris Reed. 2022. From fallacies to semi-fake news: Improving the identification of misinformation triggers across digital media. Discourse & Society 33, 3 (2022), 349–370. https://doi.org/10.1177/09579265221076609 arXiv:https://doi.org/10.1177/09579265221076609
  14. On Experiments of Detecting Persuasion Techniques in Polish and Russian Online News: Preliminary Study. In Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023), Jakub Piskorski, Michał Marcińczuk, Preslav Nakov, Maciej Ogrodniczuk, Senja Pollak, Pavel Přibáň, Piotr Rybak, Josef Steinberger, and Roman Yangarber (Eds.). Association for Computational Linguistics, Dubrovnik, Croatia, 155–164. https://doi.org/10.18653/v1/2023.bsnlp-1.18
  15. Logical Fallacies in Social Media: A Discourse Analysis in Political Debate. In 2020 8th International Conference on Cyber and IT Service Management (CITSM). 1–5. https://doi.org/10.1109/CITSM50537.2020.9268821
  16. What Does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image Classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 15691–15701.
  17. Breaking Down the Invisible Wall of Informal Fallacies in Online Discussions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 644–657. https://doi.org/10.18653/v1/2021.acl-long.53
  18. Robust and explainable identification of logical fallacies in natural language arguments. Know.-Based Syst. 266, C (apr 2023), 22 pages. https://doi.org/10.1016/j.knosys.2023.110418
  19. Christian Schlereth Sven Beisecker and Sebastian Hein. 2024. Shades of fake news: how fallacies influence consumers’ perception. European Journal of Information Systems 33, 1 (2024), 41–60. https://doi.org/10.1080/0960085X.2022.2110000 arXiv:https://doi.org/10.1080/0960085X.2022.2110000
  20. Large Language Models Are Zero-Shot Text Classifiers. arXiv:2312.01044 [cs.CL]
  21. Sentiment Analysis in the Era of Large Language Models: A Reality Check. arXiv:2305.15005 [cs.CL]
Citations (2)

Summary

  • The paper presents a methodological framework for evaluating GPT-4's fallacy detection, reporting 79% overall accuracy and 90% after data refinement.
  • It emphasizes rigorous evaluation in HCI research to ensure LLMs address digital misinformation effectively.
  • The findings urge researchers to balance the strengths of LLMs with critical assessment to mitigate limitations in logical reasoning detection.

The paper "Evaluation of an LLM in Identifying Logical Fallacies: A Call for Rigor When Adopting LLMs in HCI Research" explores the application of a LLM in identifying logical fallacies within the context of Human-Computer Interaction (HCI) research. Given the burgeoning interest in integrating LLMs into various research domains, including HCI, the authors advocate for a critical and rigorous evaluation approach to ensure that LLMs are fit for their intended tasks.

The primary focus of the paper is to assess the effectiveness of GPT-4 in identifying logical fallacies, which are common errors in reasoning that can undermine the validity of an argument. The ability to detect these fallacies is particularly pertinent as part of a broader strategy to combat digital misinformation, a significant issue in today's information landscape.

The researchers developed a methodological framework for evaluating GPT-4's performance, utilizing a labeled dataset as the benchmark. In their analysis, GPT-4 achieved an accuracy rate of 0.79 in identifying logical fallacies. However, for the specific use case excluding instances that were invalid or misidentified, the accuracy rose to 0.90. These findings indicate that while GPT-4 shows promise, it does have some limitations that necessitate caution.

Significantly, the paper underscores the importance of not overly relying on the perceived omnipotence of LLMs. The authors reflect on the areas where GPT-4 performs well and where it falls short, suggesting that researchers and practitioners must maintain a critical perspective. This balanced view aims to harness the strengths of LLMs while being mindful of their imperfections.

In conclusion, this paper contributes to the discourse on the responsible adoption of LLMs in HCI research, urging the community to prioritize rigorous evaluation to ensure these powerful tools are leveraged effectively and appropriately. The findings provide a foundation for proceeding with the use of LLMs in tasks such as identifying logical fallacies, all the while acknowledging and addressing their constraints.

X Twitter Logo Streamline Icon: https://streamlinehq.com