Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Published 11 Oct 2024 in cs.CL and cs.AI | (2410.08968v2)

Abstract: The current paradigm for safety alignment of LLMs follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned. We propose Controllable Safety Alignment (CoSA), a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow safety configs -- free-form natural language descriptions of the desired safety behaviors -- that are provided as part of the system prompt. To adjust model safety behavior, authorized users only need to modify such safety configs at inference time. To enable that, we propose CoSAlign, a data-centric method for aligning LLMs to easily adapt to diverse safety configs. Furthermore, we devise a novel controllability evaluation protocol that considers both helpfulness and configured safety, summarizing them into CoSA-Score, and construct CoSApien, a human-authored benchmark that consists of real-world LLM use cases with diverse safety requirements and corresponding evaluation prompts. We show that CoSAlign leads to substantial gains of controllability over strong baselines including in-context alignment. Our framework encourages better representation and adaptation to pluralistic human values in LLMs, and thereby increasing their practicality.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces CoSA, a framework that adapts large language models to diverse safety requirements at inference time without retraining.
It employs CtrlSafe-FT to train models with a broad range of safety configurations, allowing dynamic customization through natural language instructions.
The study validates its approach using CoSA-Score and CoSA-Bench, demonstrating significant improvements over traditional safety alignment methods.

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Introduction

The paper presents a novel framework, Controllable Safety Alignment (CoSA), designed to adapt LLMs to a variety of safety requirements without the need for retraining. This approach addresses the limitations of the conventional one-size-fits-all model safety alignment, which fails to account for the diverse social norms and safety needs across different cultures and applications. CoSA empowers users through safety configs—free-form natural language instructions that delineate desired safety behaviors—enabling dynamic adaptation at inference time.

Methodology

The CoSA framework comprises two primary components:

Controllable Model Creation: Using CtrlSafe-FT, models are engineered to respond to safety instructions embedded within the system prompt. This involves training LLMs with a broad spectrum of safety configs, enhancing their adaptability to varying safety demands.
Inference-Time Adaptation: Authorized users modify safety configs, facilitating model behavior customization on-the-fly without necessitating retraining. These adaptations form customized user interfaces, ensuring the model aligns with specific safety requirements.

The paper introduces a novel evaluation protocol, CoSA-Score, which assesses models based on helpfulness and conformity to safety instructions. This is supported by CoSA-Bench, a benchmark suite with real-world scenarios and prompts.

Key Findings

CtrlSafe-FT significantly advances controllability over traditional models, including those utilizing in-context alignment. It generalizes well to unseen safety configs, effectively managing diverse safety scenarios. This approach leverages a risk taxonomy derived from training prompts to synthesize diverse safety requirements, coupling this with data-centric strategies to refine model alignment through preference optimization techniques such as direct preference optimization (DPO).

Implications

The CoSA framework marks a pivotal shift in safety alignment methodology, ensuring flexibility and adaptability in LLM deployment across varying contexts. This has profound implications for broadening the applicability of LLMs, allowing for nuanced safety customizations that accommodate cultural and application-specific variations. It underscores the necessity for models that can navigate pluralistic human values, enhancing their utility in increasingly global and diverse user environments.

Future Directions

Future research should focus on refining controllability techniques and exploring how these approaches scale with different model architectures and sizes. Investigating the potential for representation engineering in strengthening model safety adaptability presents another promising avenue. Ethical considerations around the deployment of controllable models remain paramount, necessitating robust safeguards against misuse.

Conclusion

The paper articulates a compelling argument for reimagining safety alignment in LLMs, introducing a framework that harmonizes model adaptability with diverse human values. This approach not only enhances model utility but also navigates the complexities of different safety paradigms across global contexts. CoSA and its supporting methodologies pave the way for more sophisticated, context-aware AI systems, adaptable in real-time to meet a spectrum of safety requirements.