- The paper shows that safety prompts mainly shift model representations to increase refusal rates without enhancing LLMs' ability to distinguish harmful queries.
- The paper introduces Directed Representation Optimization (DRO), a method that fine-tunes safety prompt representations to counteract undesired model behavior.
- The paper demonstrates that DRO effectively reduces compliance with harmful queries and mitigates false refusals, thereby enhancing LLM safety without impairing performance.
Introduction
LLMs such as ChatGPT and LLaMA have become increasingly proficient, elevating concerns about their potential misuse, particularly in responding to queries with harmful intents. Safeguarding these models typically involves prepending inputs with safety prompts, which include explicit guidelines intended to steer the model towards safe responses. The effectiveness of these safety prompts in influencing LLM behavior has not been thoroughly understood, which limits the advancement of automated optimization techniques for enhancing LLM safety.
Evaluating Safety Prompt Mechanisms
This paper begins by evaluating the hypothesis that safety prompts either help LLMs better distinguish between harmful and harmless queries or they increase the overall refusal probability, meaning the likelihood of the model choosing not to engage. By collecting a set of carefully synthesized queries and utilizing PCA to analyze the hidden states of various openly available LLMs, researchers made critical observations: while LLMs are intrinsically capable of distinguishing harmful queries from harmless ones, safety prompts do not substantially improve this capability. Instead, regardless of a query's nature, safety prompts shift model representations in a direction that increases the refusal probability.
Directed Representation Optimization (DRO)
Building upon these insights, the paper introduces a new method, Directed Representation Optimization (DRO), to automate the optimization of safety prompts. DRO reinterprets safety prompts as malleable components, updating their representations to either align with or oppose the direction in which model's refusal likelihood intensifies. It optimizes this adjustment process while maintaining the general capabilities of the LLM. The technique was proven effective across a range of models and against robust out-of-domain benchmarks, successfully reducing compliance with harmful queries and mitigating false refusals for harmless inquiries.
Conclusion and Impact
The impact of this research is multifaceted, as it improves our comprehension of how safety prompts affect LLMs and provides a method to enhance the safety of such models without undermining their general utility or bespoke capabilities. DRO presents a significant stride in the direction of creating LLMs that are helpful but also reliably harmless, adhering to ethical guidelines and avoiding engagement with unsafe content. Its efficacy in real-world applications could significantly diminish the risks associated with the deployment of LLMs in varied contexts. This work serves as a potential benchmark for further exploration in the field of LLM safety.