On Prompt-Driven Safeguarding for Large Language Models (2401.18018v4)

Published 31 Jan 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Prepending model inputs with safety prompts is a common practice for safeguarding LLMs against queries with harmful intents. However, the underlying working mechanisms of safety prompts have not been unraveled yet, restricting the possibility of automatically optimizing them to improve LLM safety. In this work, we investigate how LLMs' behavior (i.e., complying with or refusing user queries) is affected by safety prompts from the perspective of model representation. We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction, in which models become more prone to refusing to provide assistance, even when the queries are harmless. On the other hand, LLMs are naturally capable of distinguishing harmful and harmless queries without safety prompts. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO (Directed Representation Optimization). Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness. Experiments with eight LLMs on out-of-domain and jailbreak benchmarks demonstrate that DRO remarkably improves the safeguarding performance of human-crafted safety prompts, without compromising the models' general performance.

Citations (34)

View on Semantic Scholar

Summary

The paper shows that safety prompts mainly shift model representations to increase refusal rates without enhancing LLMs' ability to distinguish harmful queries.
The paper introduces Directed Representation Optimization (DRO), a method that fine-tunes safety prompt representations to counteract undesired model behavior.
The paper demonstrates that DRO effectively reduces compliance with harmful queries and mitigates false refusals, thereby enhancing LLM safety without impairing performance.

Introduction

LLMs such as ChatGPT and LLaMA have become increasingly proficient, elevating concerns about their potential misuse, particularly in responding to queries with harmful intents. Safeguarding these models typically involves prepending inputs with safety prompts, which include explicit guidelines intended to steer the model towards safe responses. The effectiveness of these safety prompts in influencing LLM behavior has not been thoroughly understood, which limits the advancement of automated optimization techniques for enhancing LLM safety.

Evaluating Safety Prompt Mechanisms

This paper begins by evaluating the hypothesis that safety prompts either help LLMs better distinguish between harmful and harmless queries or they increase the overall refusal probability, meaning the likelihood of the model choosing not to engage. By collecting a set of carefully synthesized queries and utilizing PCA to analyze the hidden states of various openly available LLMs, researchers made critical observations: while LLMs are intrinsically capable of distinguishing harmful queries from harmless ones, safety prompts do not substantially improve this capability. Instead, regardless of a query's nature, safety prompts shift model representations in a direction that increases the refusal probability.

Directed Representation Optimization (DRO)

Building upon these insights, the paper introduces a new method, Directed Representation Optimization (DRO), to automate the optimization of safety prompts. DRO reinterprets safety prompts as malleable components, updating their representations to either align with or oppose the direction in which model's refusal likelihood intensifies. It optimizes this adjustment process while maintaining the general capabilities of the LLM. The technique was proven effective across a range of models and against robust out-of-domain benchmarks, successfully reducing compliance with harmful queries and mitigating false refusals for harmless inquiries.

Conclusion and Impact

The impact of this research is multifaceted, as it improves our comprehension of how safety prompts affect LLMs and provides a method to enhance the safety of such models without undermining their general utility or bespoke capabilities. DRO presents a significant stride in the direction of creating LLMs that are helpful but also reliably harmless, adhering to ethical guidelines and avoiding engagement with unsafe content. Its efficacy in real-world applications could significantly diminish the risks associated with the deployment of LLMs in varied contexts. This work serves as a potential benchmark for further exploration in the field of LLM safety.