Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs (2308.13387v2)

Published 25 Aug 2023 in cs.CL

Abstract: With the rapid evolution of LLMs, new and hard-to-predict harmful capabilities are emerging. This requires developers to be able to identify risks through the evaluation of "dangerous capabilities" in order to responsibly deploy LLMs. In this work, we collect the first open-source dataset to evaluate safeguards in LLMs, and deploy safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible LLMs should not follow. We annotate and assess the responses of six popular LLMs to these instructions. Based on our annotation, we proceed to train several BERT-like classifiers, and find that these small classifiers can achieve results that are comparable with GPT-4 on automatic safety evaluation. Warning: this paper contains example data that may be offensive, harmful, or biased.

PDF Abstract

Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

The paper "Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs" addresses the increasingly critical need for evaluating and enhancing the safety measures of LLMs. The core contribution of this paper is the creation and public release of the "Do-Not-Answer" dataset, an open-source repository explicitly designed to test LLMs against potentially harmful instructions. This dataset is meticulously curated to consist only of prompts that ethically responsible LLMs should recognize and refuse to comply with.

Three-Level Safety Taxonomy for LLMs

The work introduces a structured approach to assessing LLM safety with a proposed three-level hierarchical taxonomy that categorizes risks into five primary areas:

Information Hazards: Risks from both organizational and personal sensitive information leakage.
Malicious Uses: Includes assistance in illegal activities, unethical actions, and reduction of disinformation campaign costs.
Discrimination, Exclusion, and Toxicity: Involves social stereotypes, hate speech, and adult content.
Misinformation Harms: Dissemination of misleading or false information, with particular attention to health and legal advice risks.
Human-Chatbot Interaction Harms: Includes risks related to mental health crises and overreliance on the chatbot as a human substitute.

Dataset Composition and Purpose

The dataset comprises 939 prompts and 5,634 responses from six popular LLMs, both commercial (e.g., GPT-4, ChatGPT) and open-source (e.g., LLaMA-2, Vicuna). This dataset not only serves as a tool for evaluating existing safety interventions in LLMs but also acts as a benchmark for developing safer models. It fills a noticeable gap by providing an open-source alternative to the limited-access datasets maintained by organizations like OpenAI and Anthropic.

Methodology and Findings

The authors undertook a comprehensive manual annotation process to categorize the responses from each LLM against their prompts. Responses are judged based on their safety and suitability into six action categories, ranging from refusal to provide answers (a safe response) to directly following harmful instructions (an unsafe response).

Key Findings:

LLaMA-2 showed the highest safety level, with the fewest harmful responses (3 out of 939).
ChatGLM2 was the least safe model, with a significant number of responses falling into the harmful category.

Automatic Response Evaluation

Much emphasis is placed on deploying automatic evaluation methods to complement human judgment, due to considerations like scalability and efficiency. The paper explores two main approaches:

LLM-based Evaluation using GPT-4, which provides zero-shot evaluations based on carefully designed prompts.
PLM-based Evaluation using fine-tuned models like Longformer, which are validated to deliver results comparable to GPT-4 with substantial correlation to human evaluations.

Implications and Future Directions

This paper underscores the imperative need for robust evaluation frameworks and datasets to ensure the ethical development and deployment of LLMs. By making safety evaluation open-source, the authors support the broader AI community in cultivating models better equipped to recognize and reject potentially harmful prompts.

Future work could involve expanding this dataset to include non-risky instructions to detect over-sensitivity in LLMs, developing multi-label annotation processes to capture nuanced response overlaps, and extending evaluations across languages and more interactive settings. Such initiatives will be crucial in guiding both academic and industrial efforts to enhance the safety mechanisms in next-generation LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yuxia Wang (41 papers)
Haonan Li (43 papers)
Xudong Han (40 papers)
Preslav Nakov (253 papers)
Timothy Baldwin (125 papers)

Citations (85)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Libr-AI/do-not-answer: Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs (109 stars)