Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (2312.06674v1)

Published 7 Dec 2023 in cs.CL and cs.AI

Abstract: We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a LLM, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety.

PDF HTML Abstract

An Analysis of "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations"

In the paper "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations," the authors present a model designed to address safety concerns in interactions between humans and AI conversational agents. This work emerges in response to the growing deployment of LLMs in various applications where the potential risks associated with their usage must be carefully managed. Llama Guard seeks to establish a robust framework for input-output safeguarding by leveraging LLMs themselves as a moderation tool.

The primary contribution of this paper is the introduction of Llama Guard, an LLM-based model that is designed to classify safety risks at both the input and output stages of human-AI conversations. This model employs a safety risk taxonomy tailored to identify and categorize potential risks, ranging from violence and hate speech to illegal activities and self-harm. The taxonomy serves as a fundamental component for both prompt classification and response classification. The model is instruction-tuned using a specially curated dataset, which—despite its limited size—achieves high performance on benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat.

Key Contributions:

Safety Risk Taxonomy: The paper introduces a taxonomy encompassing various categories of risks associated with AI interactions, such as violence, sexual content, illegal activities, and more. This taxonomy underlies the classification process and is adaptable for specific use cases.
Model Architecture and Instruction Tuning: Llama Guard utilizes the Llama2-7b model, fine-tuned to perform classifications by following instructional prompts. This design allows for customization, enabling users to adapt the model for zero-shot or few-shot prompting under different taxonomies.
Performance and Adaptability: The model is shown to surpass existing moderation tools in benchmarks and exhibits strong zero-shot capabilities. In particular, Llama Guard demonstrates competitive performance on the OpenAI Moderation dataset and outperforms other methods on the ToxicChat dataset.
Public Release and Encouragement for Further Research: By releasing the model weights publicly, the authors invite further development and adaptation by the research community, highlighting the model's potential as a foundational tool for advancing AI safety.

Numerical Results and Claims:

The paper provides various evaluation results, benchmarked against existing moderation tools like Perspective API, OpenAI's Moderation API, and Azure AI Content Safety. Llama Guard is reported to achieve an AUPRC (Area Under the Precision-Recall Curve) of 0.945 for prompt classification on its internal test set, outperforming baseline models significantly. It also maintains robust performance in adapting to different datasets, demonstrating its potential flexibility in handling diverse classification tasks.

Implications and Future Directions:

This research implies substantial progress in AI safety and moderation. By utilizing LLMs as both conversational agents and moderators, Llama Guard proposes a novel approach to AI governance that can be adapted and scaled alongside the growing capabilities of AI systems. The capacity to perform zero-shot and few-shot adaptations suggests future avenues for integrating safety features more seamlessly into AI systems without requiring extensive retraining. Moreover, by making the Llama Guard model available to the community, the authors encourage broader exploration of its utility and improvements in content moderation.

In conclusion, Llama Guard stands as a significant contribution to the field of AI safety, offering a practical solution to a prevalent challenge in human-AI interactions. As LLMs become increasingly ubiquitous, the methodologies and insights from this work may guide continued efforts in ensuring their safe and responsible deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Hakan Inan (8 papers)
Kartikeya Upasani (7 papers)
Jianfeng Chi (23 papers)
Rashi Rungta (5 papers)
Krithika Iyer (13 papers)
Yuning Mao (34 papers)
Michael Tontchev (1 paper)
Qing Hu (17 papers)
Brian Fuller (3 papers)
Davide Testuggine (7 papers)
Madian Khabsa (38 papers)

Citations (254)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos