Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs for Generating and Evaluating Counterfactuals: A Comprehensive Study (2405.00722v2)

Published 26 Apr 2024 in cs.CL and cs.AI

Abstract: As NLP models become more complex, understanding their decisions becomes more crucial. Counterfactuals (CFs), where minimal changes to inputs flip a model's prediction, offer a way to explain these models. While LLMs have shown remarkable performance in NLP tasks, their efficacy in generating high-quality CFs remains uncertain. This work fills this gap by investigating how well LLMs generate CFs for two NLU tasks. We conduct a comprehensive comparison of several common LLMs, and evaluate their CFs, assessing both intrinsic metrics, and the impact of these CFs on data augmentation. Moreover, we analyze differences between human and LLM-generated CFs, providing insights for future research directions. Our results show that LLMs generate fluent CFs, but struggle to keep the induced changes minimal. Generating CFs for Sentiment Analysis (SA) is less challenging than NLI where LLMs show weaknesses in generating CFs that flip the original label. This also reflects on the data augmentation performance, where we observe a large gap between augmenting with human and LLMs CFs. Furthermore, we evaluate LLMs' ability to assess CFs in a mislabelled data setting, and show that they have a strong bias towards agreeing with the provided labels. GPT4 is more robust against this bias and its scores correlate well with automatic metrics. Our findings reveal several limitations and point to potential future work directions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Improving classifier robustness through active generative counterfactual data augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 127–139, Singapore. Association for Computational Linguistics.
  2. Towards llm-guided causal explainability for black-box text classifiers.
  3. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  4. DISCO: Distilling counterfactuals with large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5514–5528, Toronto, Canada. Association for Computational Linguistics.
  5. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  6. Data augmentation using llms: Data perspectives, learning paradigms and challenges. arXiv preprint arXiv:2403.02990.
  7. A survey on in-context learning. arXiv preprint arXiv:2301.00234.
  8. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics.
  9. Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655, Melbourne, Australia. Association for Computational Linguistics.
  10. Hallucinations in Large Multilingual Translation Models. Transactions of the Association for Computational Linguistics, 11:1500–1517.
  11. Training compute-optimal large language models.
  12. Mistral 7b.
  13. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations.
  14. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193–203, Tampere, Finland. European Association for Machine Translation.
  15. Prompting large language models for counterfactual generation: An empirical study.
  16. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Association for Computational Linguistics.
  17. LLM comparative assessment: Zero-shot NLG evaluation through pairwise comparisons using large language models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 139–151, St. Julian’s, Malta. Association for Computational Linguistics.
  18. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  19. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems, volume 36, pages 46534–46594. Curran Associates, Inc.
  20. Benchmarking large language model capabilities for conditional generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9194–9213, Toronto, Canada. Association for Computational Linguistics.
  21. Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38.
  22. Christoph Molnar. 2022. Interpretable Machine Learning, 2 edition.
  23. TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126, Online. Association for Computational Linguistics.
  24. Llm evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076.
  25. Language Models are Unsupervised Multitask Learners.
  26. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  27. Generating realistic natural language counterfactuals. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3611–3625, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  28. Explaining NLP models via minimal contrastive editing (MiCE). In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3840–3852, Online. Association for Computational Linguistics.
  29. CATfOOD: Counterfactual augmented training for improving out-of-domain performance and calibration. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1876–1898, St. Julian’s, Malta. Association for Computational Linguistics.
  30. How does counterfactually augmented data impact models for social computing constructs? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 325–344, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  31. Llama 2: Open foundation and fine-tuned chat models.
  32. CREST: A joint framework for rationalization and counterfactual text generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15109–15126, Toronto, Canada. Association for Computational Linguistics.
  33. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  34. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  35. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6707–6723, Online. Association for Computational Linguistics.
  36. A survey of large language models. arXiv preprint arXiv:2303.18223.
  37. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, pages 46595–46623. Curran Associates, Inc.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Van Bach Nguyen (7 papers)
  2. Paul Youssef (13 papers)
  3. Jörg Schlötterer (35 papers)
  4. Christin Seifert (46 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com