People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection
Abstract: NLP models are used in a variety of critical social computing tasks, such as detecting sexist, racist, or otherwise hateful content. Therefore, it is imperative that these models are robust to spurious features. Past work has attempted to tackle such spurious features using training data augmentation, including Counterfactually Augmented Data (CADs). CADs introduce minimal changes to existing training data points and flip their labels; training on them may reduce model dependency on spurious features. However, manually generating CADs can be time-consuming and expensive. Hence in this work, we assess if this task can be automated using generative NLP models. We automatically generate CADs using Polyjuice, ChatGPT, and Flan-T5, and evaluate their usefulness in improving model robustness compared to manually-generated CADs. By testing both model performance on multiple out-of-domain test sets and individual data point efficacy, our results show that while manual CADs are still the most effective, CADs generated by ChatGPT come a close second. One key reason for the lower performance of automated methods is that the changes they introduce are often insufficient to flip the original label.
- Can we trust the evaluation on chatgpt? ArXiv preprint, abs/2303.12767.
- Revisiting contextual toxicity detection in conversations. ACM Journal of Data and Information Quality, 15(1):1–22.
- Faithfulness Tests for Natural Language Explanations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 283–294, Toronto, Canada. Association for Computational Linguistics.
- Fact checking with insufficient evidence. Transactions of the Association for Computational Linguistics, 10:746–763.
- Factuality Challenges in the Era of Large Language Models.
- SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 54–63, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
- Adversarial filters of dataset biases. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 1078–1088. PMLR.
- Speak, memory: An archaeology of books known to chatgpt/gpt-4. ArXiv preprint, abs/2305.00118.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4299–4307.
- Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416.
- Auggpt: Leveraging chatgpt for text data augmentation.
- Thomas G Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895–1923.
- Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 67–73.
- Understanding dataset difficulty with v-usable information. In International Conference on Machine Learning, pages 5988–6008. PMLR.
- A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online. Association for Computational Linguistics.
- Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics.
- An expert annotated dataset for the detection of online misogyny. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1336–1350, Online. Association for Computational Linguistics.
- NeuroCounterfactuals: Beyond minimal-edit counterfactuals for richer data augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5056–5072, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177.
- Evaluating the effectiveness of deplatforming as a moderation strategy on twitter. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2):1–30.
- An investigation of the (in)effectiveness of counterfactually augmented data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3668–3681, Dublin, Ireland. Association for Computational Linguistics.
- Exploiting asymmetry for synthetic training data generation: Synthie and the case of information extraction. ArXiv preprint, abs/2303.04132.
- Fasttext. zip: Compressing text classification models. ArXiv preprint, abs/1612.03651.
- Learning the difference that makes A difference with counterfactually-augmented data. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Fereshte Khani and Percy Liang. 2021. Removing spurious features can hurt accuracy and affect groups disproportionately. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 196–205.
- SemEval-2023 task 10: Explainable detection of online sexism. pages 2193–2210.
- Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1885–1894. PMLR.
- A new generation of perspective api: Efficient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3197–3207.
- Linguistically-informed transformations (LIT): A method for automatically generating contrast sets. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 126–135, Online. Association for Computational Linguistics.
- Roberta: A robustly optimized bert pretraining approach. ArXiv preprint, abs/1907.11692.
- Generate your counterfactuals: Towards controlled counterfactual generation for text. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15):13516–13524.
- Overview of the hasoc track at fire 2020: Hate speech and offensive language identification in tamil, malayalam, hindi, english and german. In Forum for information retrieval evaluation, pages 29–32.
- Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282.
- Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
- Is a prompt and a few samples all you need? using gpt-4 for data augmentation in low-resource classification tasks. ArXiv preprint, abs/2304.13861.
- Unintended bias in misogyny detection. In IEEE/WIC/ACM International Conference on Web Intelligence, WI ’19, page 149–155, New York, NY, USA. Association for Computing Machinery.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Raiders of the lost kek: 3.5 years of augmented 4chan posts from the politically incorrect board. In Proceedings of the international AAAI conference on web and social media, volume 14, pages 885–894.
- Judea Pearl. 2014. Interpretation and identification of causal mediation. Psychological methods, 19(4):459.
- Combining feature and instance attribution to detect artifacts. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1934–1946, Dublin, Ireland. Association for Computational Linguistics.
- Automatic prompt optimization with" gradient descent" and beam search. ArXiv preprint, abs/2305.03495.
- A benchmark dataset for learning to intervene in online hate speech. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4755–4764, Hong Kong, China. Association for Computational Linguistics.
- Is chatgpt a general-purpose natural language processing task solver? ArXiv preprint, abs/2302.06476.
- Language models are unsupervised multitask learners.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- Quick, community-specific learning: How distinctive toxicity norms are maintained in political subreddits. In Proceedings of the International AAAI Conference on Web and Social Media, volume 14, pages 557–568.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- The evolution of the manosphere across the web. In Proceedings of the International AAAI Conference on Web and Social Media, volume 15, pages 196–207.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
- Generating realistic natural language counterfactuals. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3611–3625, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Overview of exist 2021: sexism identification in social networks. Procesamiento del Lenguaje Natural, 67(0).
- Explaining NLP models via minimal contrastive editing (MiCE). In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3840–3852, Online. Association for Computational Linguistics.
- Tailor: Generating and perturbing text with semantic controls. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3194–3213, Dublin, Ireland. Association for Computational Linguistics.
- HateCheck: Functional tests for hate speech detection models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 41–58, Online. Association for Computational Linguistics.
- Spillover of antisocial behavior from fringe platforms: The unintended consequences of community banning. In Proceedings of the International AAAI Conference on Web and Social Media, volume 17, pages 742–753.
- "call me sexist, but…" : Revisiting sexism detection using psychological scales and adversarial samples. Proceedings of the International AAAI Conference on Web and Social Media, 15(1):573–584.
- How does counterfactually augmented data impact models for social computing constructs? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 325–344, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Counterfactually augmented data and unintended bias: The case of sexism and hate speech detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4716–4726, Seattle, United States. Association for Computational Linguistics.
- Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, Online. Association for Computational Linguistics.
- Generating faithful synthetic data with large language models: A case study in computational social science. ArXiv preprint, abs/2305.15041.
- Learning from the worst: Dynamically generated datasets to improve online hate detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1667–1682, Online. Association for Computational Linguistics.
- How far can camels go? exploring the state of instruction tuning on open resources. ArXiv preprint, abs/2306.04751.
- Self-instruct: Aligning language model with self generated instructions. ArXiv preprint, abs/2212.10560.
- Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6707–6723, Online. Association for Computational Linguistics.
- A theory of usable information under computational constraints. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Can large language models transform computational social science? ArXiv preprint, abs/2305.03514.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.