Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
The paper "Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs" addresses the increasingly critical need for evaluating and enhancing the safety measures of LLMs. The core contribution of this paper is the creation and public release of the "Do-Not-Answer" dataset, an open-source repository explicitly designed to test LLMs against potentially harmful instructions. This dataset is meticulously curated to consist only of prompts that ethically responsible LLMs should recognize and refuse to comply with.
Three-Level Safety Taxonomy for LLMs
The work introduces a structured approach to assessing LLM safety with a proposed three-level hierarchical taxonomy that categorizes risks into five primary areas:
- Information Hazards: Risks from both organizational and personal sensitive information leakage.
- Malicious Uses: Includes assistance in illegal activities, unethical actions, and reduction of disinformation campaign costs.
- Discrimination, Exclusion, and Toxicity: Involves social stereotypes, hate speech, and adult content.
- Misinformation Harms: Dissemination of misleading or false information, with particular attention to health and legal advice risks.
- Human-Chatbot Interaction Harms: Includes risks related to mental health crises and overreliance on the chatbot as a human substitute.
Dataset Composition and Purpose
The dataset comprises 939 prompts and 5,634 responses from six popular LLMs, both commercial (e.g., GPT-4, ChatGPT) and open-source (e.g., LLaMA-2, Vicuna). This dataset not only serves as a tool for evaluating existing safety interventions in LLMs but also acts as a benchmark for developing safer models. It fills a noticeable gap by providing an open-source alternative to the limited-access datasets maintained by organizations like OpenAI and Anthropic.
Methodology and Findings
The authors undertook a comprehensive manual annotation process to categorize the responses from each LLM against their prompts. Responses are judged based on their safety and suitability into six action categories, ranging from refusal to provide answers (a safe response) to directly following harmful instructions (an unsafe response).
Key Findings:
- LLaMA-2 showed the highest safety level, with the fewest harmful responses (3 out of 939).
- ChatGLM2 was the least safe model, with a significant number of responses falling into the harmful category.
Automatic Response Evaluation
Much emphasis is placed on deploying automatic evaluation methods to complement human judgment, due to considerations like scalability and efficiency. The paper explores two main approaches:
- LLM-based Evaluation using GPT-4, which provides zero-shot evaluations based on carefully designed prompts.
- PLM-based Evaluation using fine-tuned models like Longformer, which are validated to deliver results comparable to GPT-4 with substantial correlation to human evaluations.
Implications and Future Directions
This paper underscores the imperative need for robust evaluation frameworks and datasets to ensure the ethical development and deployment of LLMs. By making safety evaluation open-source, the authors support the broader AI community in cultivating models better equipped to recognize and reject potentially harmful prompts.
Future work could involve expanding this dataset to include non-risky instructions to detect over-sensitivity in LLMs, developing multi-label annotation processes to capture nuanced response overlaps, and extending evaluations across languages and more interactive settings. Such initiatives will be crucial in guiding both academic and industrial efforts to enhance the safety mechanisms in next-generation LLMs.