Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ShieldGemma: Generative AI Content Moderation Based on Gemma (2407.21772v2)

Published 31 Jul 2024 in cs.CL and cs.LG

Abstract: We present ShieldGemma, a comprehensive suite of LLM-based safety content moderation models built upon Gemma2. These models provide robust, state-of-the-art predictions of safety risks across key harm types (sexually explicit, dangerous content, harassment, hate speech) in both user input and LLM-generated output. By evaluating on both public and internal benchmarks, we demonstrate superior performance compared to existing models, such as Llama Guard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%). Additionally, we present a novel LLM-based data curation pipeline, adaptable to a variety of safety-related tasks and beyond. We have shown strong generalization performance for model trained mainly on synthetic data. By releasing ShieldGemma, we provide a valuable resource to the research community, advancing LLM safety and enabling the creation of more effective content moderation solutions for developers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. A. Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1, 2024.
  3. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671, 2019.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  5. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669, 2024.
  6. Batch active learning at scale. Advances in Neural Information Processing Systems, 34:11933–11944, 2021.
  7. Rethinking conversational agents in the era of llms: Proactivity, non-collaborativity, and beyond. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 298–301, 2023.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Self-guided noise-free data generation for efficient zero-shot learning. arXiv preprint arXiv:2205.12679, 2022.
  10. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts. arXiv preprint arXiv:2404.05993, 2024.
  11. Google. Perspective api. https://www.perspectiveapi.com/, 2017.
  12. Google. Responsible generative ai toolkit: https://ai.google.dev/responsible/principles, 2024.
  13. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495, 2024.
  14. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers. arXiv preprint arXiv:2403.02839, 2024.
  15. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
  16. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023. URL https://arxiv.org/abs/2307.04657.
  17. Ask me what you need: Product retrieval using knowledge from gpt-3. arXiv preprint arXiv:2207.02516, 2022.
  18. Harnessing large-language models to generate private synthetic text. arXiv preprint arXiv:2306.01684, 2023.
  19. Counterfactual fairness. Advances in neural information processing systems, 30, 2017.
  20. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044, 2024.
  21. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023. URL https://arxiv.org/abs/2310.17389.
  22. From llm to conversational agent: A memory enhanced architecture with fine-tuning of large language models. arXiv preprint arXiv:2401.02777, 2024.
  23. On llms-driven synthetic data generation, curation, and evaluation: A survey. arXiv preprint arXiv:2406.15126, 2024.
  24. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023.
  25. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024. URL https://arxiv.org/abs/2402.04249.
  26. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  27. Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications. arXiv preprint arXiv:2311.08592, 2023.
  28. Data augmentation for intent classification with off-the-shelf large language models. arXiv preprint arXiv:2204.01959, 2022.
  29. O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.
  30. " i’m sorry to hear that": Finding new biases in language models with a holistic descriptor dataset. arXiv preprint arXiv:2205.09209, 2022.
  31. G. Team. Gemma. 2024a. 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/m/3301.
  32. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  33. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  34. L. Team. Meta llama guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md, 2024b.
  35. Mix-and-match tuning for self-supervised semantic segmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  36. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Wenjun Zeng (130 papers)
  2. Yuchi Liu (13 papers)
  3. Ryan Mullins (6 papers)
  4. Ludovic Peran (4 papers)
  5. Joe Fernandez (2 papers)
  6. Hamza Harkous (11 papers)
  7. Karthik Narasimhan (82 papers)
  8. Drew Proud (1 paper)
  9. Piyush Kumar (47 papers)
  10. Bhaktipriya Radharapu (8 papers)
  11. Olivia Sturman (2 papers)
  12. Oscar Wahltinez (3 papers)
Citations (15)
Youtube Logo Streamline Icon: https://streamlinehq.com