Papers
Topics
Authors
Recent
Search
2000 character limit reached

Eagle: Ethical Dataset Given from Real Interactions

Published 22 Feb 2024 in cs.CL | (2402.14258v1)

Abstract: Recent studies have demonstrated that LLMs have ethical-related problems such as social biases, lack of moral reasoning, and generation of offensive content. The existing evaluation metrics and methods to address these ethical challenges use datasets intentionally created by instructing humans to create instances including ethical problems. Therefore, the data does not reflect prompts that users actually provide when utilizing LLM services in everyday contexts. This may not lead to the development of safe LLMs that can address ethical challenges arising in real-world applications. In this paper, we create Eagle datasets extracted from real interactions between ChatGPT and users that exhibit social biases, toxicity, and immoral problems. Our experiments show that Eagle captures complementary aspects, not covered by existing datasets proposed for evaluation and mitigation of such ethical challenges. Our code is publicly available at https://huggingface.co/datasets/MasahiroKaneko/eagle.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. GPT-4 technical report. https://api.semanticscholar.org/CorpusID:257532815.
  2. DUnE: Dataset for unified editing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1847–1861, Singapore. Association for Computational Linguistics.
  3. Evaluating gender bias of pre-trained language models in natural language inference by considering all labels. ArXiv, abs/2309.09697.
  4. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
  5. Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Annual Meeting of the Association for Computational Linguistics.
  6. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
  7. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, Online. Association for Computational Linguistics.
  8. Language models are few-shot learners. ArXiv, abs/2005.14165.
  9. MASTERKEY: Automated jailbreaking of large language model chatbots. Proceedings 2024 Network and Distributed System Security Symposium.
  10. Llm.int8(): 8-bit matrix multiplication for transformers at scale. ArXiv, abs/2208.07339.
  11. Latent hatred: A benchmark for understanding implicit hate speech. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 345–363, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  12. Social chemistry 101: Learning to reason about social and moral norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 653–670, Online. Association for Computational Linguistics.
  13. Bias and fairness in large language models: A survey. ArXiv, abs/2309.00770.
  14. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
  15. OLMo: Accelerating the science of language models. https://api.semanticscholar.org/CorpusID:267365485.
  16. Speaking multiple languages affects the moral bias of language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2137–2156, Toronto, Canada. Association for Computational Linguistics.
  17. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics.
  18. Aligning AI with shared human values. ArXiv, abs/2008.02275.
  19. Dirk Hovy and Shannon L. Spruit. 2016. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591–598, Berlin, Germany. Association for Computational Linguistics.
  20. Mistral 7b. ArXiv, abs/2310.06825.
  21. Can machines learn morality? the Delphi experiment. https://api.semanticscholar.org/CorpusID:250495586.
  22. When to make exceptions: Exploring language models as accounts of human moral judgment. ArXiv, abs/2210.01478.
  23. Masahiro Kaneko and Danushka Bollegala. 2021a. Debiasing pre-trained contextualised embeddings. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1256–1266, Online. Association for Computational Linguistics.
  24. Masahiro Kaneko and Danushka Bollegala. 2021b. Unmasking the mask - evaluating social biases in masked language models. In AAAI Conference on Artificial Intelligence.
  25. The gaps between pre-train and downstream settings in bias evaluation and debiasing. ArXiv, abs/2401.08511.
  26. Debiasing isn’t enough! – on the effectiveness of debiasing MLMs and their social biases in downstream tasks. In International Conference on Computational Linguistics.
  27. Comparing intrinsic gender bias evaluation measures without using human annotated examples. ArXiv, abs/2301.12074.
  28. Evaluating gender bias in large language models via chain-of-thought prompting. https://api.semanticscholar.org/CorpusID:267311383.
  29. Gender bias in masked language models for multiple languages. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2740–2750, Seattle, United States. Association for Computational Linguistics.
  30. Gender bias and stereotypes in large language models. Proceedings of The ACM Collective Intelligence Conference.
  31. Measuring bias in contextualized word representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 166–172, Florence, Italy. Association for Computational Linguistics.
  32. Comparing biases and the impact of multilingual training across multiple languages. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10260–10280, Singapore. Association for Computational Linguistics.
  33. A survey on fairness in large language models. ArXiv, abs/2308.10149.
  34. Jailbreaking ChatGPT via prompt engineering: An empirical study. ArXiv, abs/2305.13860.
  35. RoBERTa: A robustly optimized BERT pretraining approach. ArXiv, abs/1907.11692.
  36. HateXplain: A benchmark dataset for explainable hate speech detection. In AAAI Conference on Artificial Intelligence.
  37. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54:1 – 35.
  38. Saif Mohammad. 2022. Ethics sheets for AI tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8368–8379, Dublin, Ireland. Association for Computational Linguistics.
  39. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics.
  40. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online. Association for Computational Linguistics.
  41. MoCa: Measuring human-language model alignment on causal and moral judgment tasks. ArXiv, abs/2310.19677.
  42. In-contextual bias suppression for large language models. ArXiv, abs/2309.07251.
  43. Multilingual and multi-aspect hate speech analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4675–4684, Hong Kong, China. Association for Computational Linguistics.
  44. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155.
  45. Differential bias: On the perceptibility of stance imbalance in argumentation. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 411–421, Online only. Association for Computational Linguistics.
  46. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105, Dublin, Ireland. Association for Computational Linguistics.
  47. From the detection of toxic spans in online discussions to the analysis of toxic-to-civil transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3721–3734, Dublin, Ireland. Association for Computational Linguistics.
  48. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. ArXiv, abs/2306.01116.
  49. Respectful or toxic? using zero-shot learning with language models to detect hate speech. In The 7th Workshop on Online Abuse and Harms (WOAH), pages 60–68, Toronto, Canada. Association for Computational Linguistics.
  50. Towards few-shot identification of morality frames using in-context learning. In Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS), pages 183–196, Abu Dhabi, UAE. Association for Computational Linguistics.
  51. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14, New Orleans, Louisiana. Association for Computational Linguistics.
  52. Whose opinions do language models reflect? ArXiv, abs/2303.17548.
  53. Social bias frames: Reasoning about social and power implications of language. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5477–5490, Online. Association for Computational Linguistics.
  54. The structure of toxic conversations on Twitter. Proceedings of the Web Conference 2021.
  55. Probing the moral development of large language models through defining issues test. ArXiv, abs/2309.13356.
  56. MosaicML NLP Team. 2023. Introducing MPT-7B: A new standard for open-source, commercially usable LLMs. www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
  57. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  58. Unraveling downstream gender bias from large language models: A study on ai educational writing assistance. In Conference on Empirical Methods in Natural Language Processing.
  59. BiasAsker: Measuring the bias in conversational AI system. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.
  60. Assessing multilingual fairness in pre-trained multimodal representations. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2681–2695, Dublin, Ireland. Association for Computational Linguistics.
  61. Hate speech on Twitter: A pragmatic approach to collect hateful and offensive expressions and perform hate speech detection. IEEE Access, 6:13825–13835.
  62. HARE: Explainable hate speech detection with step-by-step reasoning. ArXiv, abs/2311.00321.
  63. Efficient toxic content detection by bootstrapping and distilling large language models. ArXiv, abs/2312.08303.
  64. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana. Association for Computational Linguistics.
  65. Towards identifying social bias in dialog systems: Framework, dataset, and benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3576–3591, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  66. Sense embeddings are also biased – evaluating social biases in static and contextualised sense embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1924–1935, Dublin, Ireland. Association for Computational Linguistics.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 44 likes about this paper.