Papers
Topics
Authors
Recent
Search
2000 character limit reached

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

Published 9 Apr 2024 in cs.LG, cs.CL, and cs.CY | (2404.05993v2)

Abstract: As LLMs and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new dataset of approximately 26, 000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. We plan to release this dataset to the community to further research and to help benchmark LLM models for safety. To demonstrate the effectiveness of the dataset, we instruction-tune multiple LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS), not only surpass or perform competitively with the state-of-the-art LLM-based safety models and general purpose LLMs, but also exhibit robustness across multiple jail-break attack categories. We also show how using AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of the aligned models on MT Bench scores. Furthermore, we propose AEGIS, a novel application of a no-regret online adaptation framework with strong theoretical guarantees, to perform content moderation with an ensemble of LLM content safety experts in deployment

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  2. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662, 2023.
  3. Prediction, learning, and games. Cambridge university press, 2006.
  4. How to use expert advice. Journal of the ACM (JACM), 44(3):427–485, 1997.
  5. Improved second-order bounds for prediction with expert advice. Machine Learning, 66:321–352, 2007.
  6. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  8. Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf. arXiv preprint arXiv:2310.05344, 2023.
  9. Emil Julius Gumbel. Les valeurs extrêmes des distributions statistiques. In Annales de l’institut Henri Poincaré, volume 5, pp.  115–158, 1935.
  10. James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3(2):97–139, 1957.
  11. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  12. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
  13. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  14. A new generation of perspective api: Efficient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  3197–3207, 2022.
  15. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. arXiv preprint arXiv:2310.17389, 2023.
  16. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
  17. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  15009–15018, 2023.
  18. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023.
  19. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  20. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024.
  21. Cappy: Outperforming and boosting large multi-task lms with a small scorer. Advances in Neural Information Processing Systems, 36, 2024.
  22. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  24. The art of defending: A systematic evaluation and analysis of llm defense strategies on safety and over-defensiveness. arXiv preprint arXiv:2401.00287, 2023.
  25. Simplesafetytests: a test suite for identifying critical safety risks in large language models. arXiv preprint arXiv:2311.08370, 2023.
  26. Vladimir G Vovk. A game of prediction with expert advice. In Proceedings of the eighth annual conference on Computational learning theory, pp.  51–60, 1995.
  27. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
  28. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  29. Rigorllm: Resilient guardrails for large language models against undesired content. arXiv preprint arXiv:2403.13031, 2024.
  30. Biasx:” thinking slow” in toxic content moderation with explanations of implied social biases. arXiv preprint arXiv:2305.13589, 2023.
  31. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  32. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Citations (19)

Summary

  • The paper presents a novel, no-regret online adaptation framework, AEGIS, which integrates an ensemble of LLM experts for dynamic content moderation.
  • It details the creation of the AegisSafetyDataset, annotated with a comprehensive taxonomy of 13 primary and 9 sub risk categories.
  • Experimental results demonstrate that AEGIS outperforms baseline models, showing robust performance against adversarial attacks and jailbreak techniques.

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

Introduction

The paper "AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts" (2404.05993) presents a comprehensive approach to addressing content safety in the field of LLMs. As LLMs and generative AI become increasingly pervasive, the challenges associated with content safety rise in tandem. This research introduces several pivotal innovations, including the AegisSafetyDataset and a no-regret online adaptation framework termed Aegis for moderating content via an ensemble of expert LLMs.

Content Safety Taxonomy and Dataset

One of the fundamental contributions of the paper is the development of an extensive content safety risk taxonomy. It encompasses 13 primary risk categories and an additional 9 sub-categories, providing a broad spectrum to capture diverse safety concerns across human-LLM interactions. This taxonomy serves as the foundation for the AegisSafetyDataset, a corpus of approximately 26,000 instances of human-LLM interaction, annotated meticulously to adhere to the defined taxonomy.

The dataset's richness is augmented by ensuring that it captures a balanced representation of critical risk categories, which is fundamental for training robust models. The annotation process involved a team of 12 annotators and underwent rigorous quality assurance to maintain high reliability of the labels.

Ensemble of LLM Safety Experts

The paper outlines the construction of AegisSafetyExperts, a suite of LLM-based models fine-tuned to enhance content safety. Utilizing the AegisSafetyDataset, several models are developed, including LlamaGuard variants and NeMo43B. These models are instruction-tuned to improve adaptability to novel content safety policies, and their performance is evaluated against prominent benchmarks like ToxicChat and the OpenAI Moderation Dataset.

The experimental results reveal that AegisSafetyExperts not only outperform baseline models but also maintain high robustness across diverse adversarial attack scenarios. Notably, the inclusion of the category "Needs Caution" allows the models to handle ambiguous content intelligently, toggling between permissive and defensive stances based on specific application context.

Online Adaptive Framework: Aegis

A central innovation introduced in this work is the Aegis framework, which applies a no-regret online learning methodology to adaptively moderate content using an ensemble of safety experts. This framework is grounded in the Exponential Weighs (EW) algorithm, which adjusts the influence of each expert model dynamically based on their historical performance. The framework's theoretical underpinnings ensure that the overall regret is minimized, thereby optimizing decision-making across varying distributional shifts and safety challenges. Figure 1

Figure 1: Aegis: online adaptive safety content moderation.

Experimental Validation and Results

The paper presents a thorough evaluation of the proposed models across several dimensions. On the SimpleSafetyTests benchmark, the AegisSafetyExperts demonstrate superior accuracy in identifying critical safety risks, outperforming traditional models significantly. Furthermore, perturbation methods are applied to the EW algorithm to enhance its robustness against dynamically changing content scenarios, showing promising adaptability over multiple trials. Figure 2

Figure 2

Figure 2: Aegis learns to choose the best expert over the time horizon. The EW algorithm is shown on the left and the perturbed EW version is on the right. EW enables the learner to latch on to the best expert from the start. If that expert starts performing poorly, EW may remain with that expert and very slowly adapt to the current best performing one over the time horizon. With the perturbed version, the learner can switch between experts with a randomness.

Additionally, the model's resistance to state-of-the-art jailbreak techniques such as Tree of Attacks with Pruning (TAP) and Greedy Coordinate Gradient (GCG) is assessed, with the models exhibiting significant resilience—especially those tuned using the AegisSafetyDataset. Figure 3

Figure 3: EW with perturbation averaged over 20 trials.

Implications and Future Work

The research delineates both practical and theoretical implications. Practically, the robust content moderation framework can be integrated into AI systems, enhancing their safety in deployment. Theoretically, it sets a precedent for future explorations of adaptive online learning methods in AI safety domains.

Future work will focus on expanding the dataset and further refining the models to include a broader array of adversarial scenarios, thereby strengthening the adaptive capabilities of Aegis. Additionally, exploring more sophisticated mechanisms for real-time compliance and feedback integration remains an area of active investigation.

Conclusion

The paper offers a significant contribution to the field of AI safety, presenting a comprehensive solution to content moderation utilizing an ensemble of LLM experts and an innovative online adaptive framework. This work lays a solid foundation for ongoing advancements in the safety and robustness of AI systems against evolving content-related threats.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 3 likes about this paper.