Watch Your Language: Investigating Content Moderation with Large Language Models (2309.14517v2)
Abstract: LLMs have exploded in popularity due to their ability to perform a wide array of natural language tasks. Text-based content moderation is one LLM use case that has received recent enthusiasm, however, there is little research investigating how LLMs perform in content moderation settings. In this work, we evaluate a suite of commodity LLMs on two common content moderation tasks: rule-based community moderation and toxic content detection. For rule-based community moderation, we instantiate 95 subcommunity specific LLMs by prompting GPT-3.5 with rules from 95 Reddit subcommunities. We find that GPT-3.5 is effective at rule-based moderation for many communities, achieving a median accuracy of 64% and a median precision of 83%. For toxicity detection, we evaluate a suite of commodity LLMs (GPT-3, GPT-3.5, GPT-4, Gemini Pro, LLAMA 2) and show that LLMs significantly outperform currently widespread toxicity classifiers. However, recent increases in model size add only marginal benefit to toxicity detection, suggesting a potential performance plateau for LLMs on toxicity detection tasks. We conclude by outlining avenues for future work in studying LLMs and content moderation.
- An In-depth Look at Gemini’s Language Abilities. arXiv preprint arXiv:2312.11444.
- Leveraging AI for democratic discourse: Chat interventions can improve online political conversations at scale. Proceedings of the National Academy of Sciences.
- AppealMod: Shifting Effort from Moderators to Users Making Appeals. arXiv preprint arXiv:2301.07163.
- Bastian, M. 2023. https://the-decoder.com/gpt-4-has-a-trillion-parameters.
- Language models are few-shot learners. Advances in neural information processing systems.
- Crossmod: A cross-community learning-based system to assist reddit moderators. In ACM CSCW.
- The Internet’s hidden rules: An empirical study of Reddit norm violations at micro, meso, and macro scales. In ACM CSCW.
- How is ChatGPT’s behavior changing over time? arXiv preprint arXiv:2307.09009.
- Detecting hate speech with gpt-3. arXiv preprint arXiv:2103.12407.
- Creator-friendly Algorithms: Behaviors, Challenges, and Design Opportunities in Algorithmic Platforms. In ACM Conference on Human Computer Interaction.
- Measuring and mitigating unintended bias in text classification. In AAAI Conf. on AI, Ethics, and Society.
- Latent hatred: A benchmark for understanding implicit hate speech. arXiv preprint arXiv:2109.05322.
- Reddit rules! characterizing an ecosystem of governance. In Int. AAAI Conference on Web and Social Media.
- Analyzing the Use of Large Language Models for Content Moderation with ChatGPT Examples. In International Workshop on Open Challenges in Online Social Networks.
- Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
- Gilbert, S. A. 2020. ”I run the world’s largest historical outreach project and it’s on a cesspool of a website.” Moderating a Public Scholarship Site on Reddit: A Case Study of r/AskHistorians. ACM CSCW.
- Gillespie, T. 2020. Content moderation, AI, and the question of scale. Big Data & Society.
- Google Jigsaw. 2018. Perspective API. https://www.perspectiveapi.com/.
- Mitigating racial biases in toxic language detection with an equity-based ensemble framework. In Equity and Access in Algorithms, Mechanisms, and Optimization.
- Hate raids on Twitch: Echoes of the past, new modalities, and implications for platform governance. In ACM CSCW.
- Twits, Toxic Tweets, and Tribal Tendencies: Trends in Politically Polarized Posts on Twitter. arXiv preprint arXiv:2307.10349.
- You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content. In IEEE Symposium on Security and Privacy.
- Does transparency in moderation really matter? User behavior after content removal explanations on reddit. In ACM CSCW.
- A trade-off-centered framework of content moderation. ACM Transactions on Computer-Human Interaction.
- Understanding the behaviors of toxic accounts on reddit. In ACM Web Conference.
- Designing Toxic Content Classification for a Diversity of Perspectives. In USENIX SOUPS.
- The Unsung Heroes of Facebook Groups Moderation: A Case Study of Moderation Practices and Tools. In ACM CSCW.
- Human-ai collaboration via conditional delegation: A case study of content moderation. In ACM Conference on Human Computer Interaction.
- Conversational Resilience: Quantifying and Predicting Conversational Outcomes Following Adverse Events. In AAAI Intl. Conference on Web and Social Media.
- ” HOT” ChatGPT: The promise of ChatGPT in detecting and discriminating hateful, offensive, and toxic comments on social media. arXiv preprint arXiv:2304.10619.
- Defaulting to boilerplate answers, they didn’t engage in a genuine conversation: Dimensions of Transparency Design in Creator Moderation. In ACM Conference on Computer-Supported Cooperative Work and Social Computing.
- How Good Is ChatGPT For Detecting Hate Speech In Portuguese? In Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana. SBC.
- OpenAI. 2023. Prompt Engineering. https://platform.openai.com/docs/guides/prompt-engineering/prompt-engineering.
- OpenWeb. 2021. How Multi-Layered Moderation Influences Positive Commenting. https://www.openweb.com/blog/how-multi-layered-moderation-influences-positive-commenting.
- Toxicity Detection: Does Context Really Matter? In Annual Meeting of the Association for Computational Linguistics.
- The structure of toxic conversations on Twitter. In ACM Web Conference.
- Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. Transactions of the Association for Computational Linguistics.
- Törnberg, P. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems.
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382.
- Exploring antecedents and consequences of toxicity in online discussions: A case study on reddit. ACM Conference on Human Computer Interaction.
- Conversations gone awry: Detecting early signs of conversational failure. arXiv preprint arXiv:1805.05345.
- Deepak Kumar (104 papers)
- Yousef AbuHashem (1 paper)
- Zakir Durumeric (30 papers)