ShieldGemma: Generative AI Content Moderation Based on Gemma (2407.21772v2)
Abstract: We present ShieldGemma, a comprehensive suite of LLM-based safety content moderation models built upon Gemma2. These models provide robust, state-of-the-art predictions of safety risks across key harm types (sexually explicit, dangerous content, harassment, hate speech) in both user input and LLM-generated output. By evaluating on both public and internal benchmarks, we demonstrate superior performance compared to existing models, such as Llama Guard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%). Additionally, we present a novel LLM-based data curation pipeline, adaptable to a variety of safety-related tasks and beyond. We have shown strong generalization performance for model trained mainly on synthetic data. By releasing ShieldGemma, we provide a valuable resource to the research community, advancing LLM safety and enabling the creation of more effective content moderation solutions for developers.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- A. Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1, 2024.
- Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671, 2019.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669, 2024.
- Batch active learning at scale. Advances in Neural Information Processing Systems, 34:11933–11944, 2021.
- Rethinking conversational agents in the era of llms: Proactivity, non-collaborativity, and beyond. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 298–301, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Self-guided noise-free data generation for efficient zero-shot learning. arXiv preprint arXiv:2205.12679, 2022.
- Aegis: Online adaptive ai content safety moderation with ensemble of llm experts. arXiv preprint arXiv:2404.05993, 2024.
- Google. Perspective api. https://www.perspectiveapi.com/, 2017.
- Google. Responsible generative ai toolkit: https://ai.google.dev/responsible/principles, 2024.
- Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495, 2024.
- An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers. arXiv preprint arXiv:2403.02839, 2024.
- Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023. URL https://arxiv.org/abs/2307.04657.
- Ask me what you need: Product retrieval using knowledge from gpt-3. arXiv preprint arXiv:2207.02516, 2022.
- Harnessing large-language models to generate private synthetic text. arXiv preprint arXiv:2306.01684, 2023.
- Counterfactual fairness. Advances in neural information processing systems, 30, 2017.
- Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044, 2024.
- Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023. URL https://arxiv.org/abs/2310.17389.
- From llm to conversational agent: A memory enhanced architecture with fine-tuning of large language models. arXiv preprint arXiv:2401.02777, 2024.
- On llms-driven synthetic data generation, curation, and evaluation: A survey. arXiv preprint arXiv:2406.15126, 2024.
- A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023.
- Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024. URL https://arxiv.org/abs/2402.04249.
- Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
- Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications. arXiv preprint arXiv:2311.08592, 2023.
- Data augmentation for intent classification with off-the-shelf large language models. arXiv preprint arXiv:2204.01959, 2022.
- O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.
- " i’m sorry to hear that": Finding new biases in language models with a holistic descriptor dataset. arXiv preprint arXiv:2205.09209, 2022.
- G. Team. Gemma. 2024a. 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/m/3301.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- L. Team. Meta llama guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md, 2024b.
- Mix-and-match tuning for self-supervised semantic segmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Wenjun Zeng (130 papers)
- Yuchi Liu (13 papers)
- Ryan Mullins (6 papers)
- Ludovic Peran (4 papers)
- Joe Fernandez (2 papers)
- Hamza Harkous (11 papers)
- Karthik Narasimhan (82 papers)
- Drew Proud (1 paper)
- Piyush Kumar (47 papers)
- Bhaktipriya Radharapu (8 papers)
- Olivia Sturman (2 papers)
- Oscar Wahltinez (3 papers)