Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OffensiveLang: A Community Based Implicit Offensive Language Dataset (2403.02472v8)

Published 4 Mar 2024 in cs.CL

Abstract: The widespread presence of hateful languages on social media has resulted in adverse effects on societal well-being. As a result, addressing this issue with high priority has become very important. Hate speech or offensive languages exist in both explicit and implicit forms, with the latter being more challenging to detect. Current research in this domain encounters several challenges. Firstly, the existing datasets primarily rely on the collection of texts containing explicit offensive keywords, making it challenging to capture implicitly offensive contents that are devoid of these keywords. Secondly, common methodologies tend to focus solely on textual analysis, neglecting the valuable insights that community information can provide. In this research paper, we introduce a novel dataset OffensiveLang, a community based implicit offensive language dataset generated by ChatGPT 3.5 containing data for 38 different target groups. Despite limitations in generating offensive texts using ChatGPT due to ethical constraints, we present a prompt-based approach that effectively generates implicit offensive languages. To ensure data quality, we evaluate the dataset with human. Additionally, we employ a prompt-based zero-shot method with ChatGPT and compare the detection results between human annotation and ChatGPT annotation. We utilize existing state-of-the-art models to see how effective they are in detecting such languages. The dataset is available here: https://github.com/AmitDasRup123/OffensiveLang

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. A survey on automatic detection of hate speech in text. ACM Computing Surveys (CSUR), 51(4):1–30, 2018.
  2. Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666, 2019a.
  3. Automated hate speech detection and the problem of offensive language. In Proceedings of the international AAAI conference on web and social media, volume 11, pages 512–515, 2017.
  4. Cheat: A large-scale dataset for detecting chatgpt-written abstracts. arXiv preprint arXiv:2304.12008, 2023.
  5. Distinguishing human-written and chatgpt-generated text using machine learning. In 2023 Systems and Information Engineering Design Symposium (SIEDS), pages 154–158. IEEE, 2023.
  6. Gpt-sentinel: Distinguishing human and chatgpt generated content. arXiv preprint arXiv:2305.07969, 2023.
  7. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Offensive language detection using multi-level classification. In Advances in Artificial Intelligence: 23rd Canadian Conference on Artificial Intelligence, Canadian AI 2010, Ottawa, Canada, May 31–June 2, 2010. Proceedings 23, pages 16–27. Springer, 2010.
  10. Detecting offensive language in social media to protect adolescent online safety. In 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing, pages 71–80. IEEE, 2012.
  11. Evaluating machine learning techniques for detecting offensive and hate speech in south african tweets. IEEE Access, 8:21496–21509, 2020.
  12. Contextual-lexicon approach for abusive language detection. arXiv preprint arXiv:2104.12265, 2021.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. Nuli at semeval-2019 task 6: Transfer learning for offensive language detection using bidirectional transformers. In Proceedings of the 13th international workshop on semantic evaluation, pages 87–91, 2019a.
  15. Online sexism detection and classification by injecting user gender information. In 2023 IEEE International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings), pages 1–5. IEEE, 2023.
  16. Leveraging intra-user and inter-user representation learning for automated hate speech detection. arXiv preprint arXiv:1804.03124, 2018.
  17. Detecting hate speech with gpt-3. arXiv preprint arXiv:2103.12407, 2021.
  18. " hot" chatgpt: The promise of chatgpt in detecting and discriminating hateful, offensive, and toxic comments on social media. arXiv preprint arXiv:2304.10619, 2023.
  19. You only prompt once: On the capabilities of prompt learning on large language models to tackle toxic content. arXiv preprint arXiv:2308.05596, 2023.
  20. Can chatgpt reproduce human-generated labels? a study of social computing tasks. arXiv preprint arXiv:2304.10145, 2023.
  21. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. arXiv preprint arXiv:2302.07736, 2023.
  22. Latent hatred: A benchmark for understanding implicit hate speech. arXiv preprint arXiv:2109.05322, 2021.
  23. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  24. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2, 2019.
  25. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019b.
  26. olmpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics, 8:743–758, 2020.
  27. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 1173–1178, 2019.
  28. The effect of different writing tasks on linguistic style: A case study of the roc story cloze task. arXiv preprint arXiv:1702.01841, 2017.
  29. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020.
  30. Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676, 2020a.
  31. It’s not just size that matters: Small language models are also few-shot learners. arXiv preprint arXiv:2009.07118, 2020b.
  32. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  34. Fortifying toxic speech detectors against veiled toxicity. arXiv preprint arXiv:2010.03154, 2020.
  35. Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14867–14875, 2021.
  36. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983, 2019b.
  37. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.
  38. Overview of the germeval 2018 shared task on the identification of offensive language. 2018.
  39. MS Windows NT kernel description. https://www.oxfordlearnersdictionaries.com/us/definition/english/offensive_1. Accessed: 2010-09-30.
  40. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Amit Das (28 papers)
  2. Mostafa Rahgouy (6 papers)
  3. Dongji Feng (11 papers)
  4. Zheng Zhang (486 papers)
  5. Tathagata Bhattacharya (2 papers)
  6. Nilanjana Raychawdhary (2 papers)
  7. Mary Sandage (2 papers)
  8. Lauramarie Pope (2 papers)
  9. Gerry Dozier (7 papers)
  10. Cheryl Seals (3 papers)
  11. Fatemeh Jamshidi (4 papers)
  12. Vinija Jain (42 papers)
  13. Aman Chadha (109 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com