Configurable Safety Tuning of Language Models with Synthetic Preference Data (2404.00495v1)
Abstract: State-of-the-art LLM fine-tuning techniques, such as Direct Preference Optimization (DPO), restrict user control by hard-coding predefined behaviors into the model. To address this, we propose a novel method, Configurable Safety Tuning (CST), that augments DPO using synthetic preference data to facilitate flexible safety configuration of LLMs at inference time. CST overcomes the constraints of vanilla DPO by introducing a system prompt specifying safety configurations, enabling LLM deployers to disable/enable safety preferences based on their need, just changing the system prompt. Our experimental evaluations indicate that CST successfully manages different safety configurations and retains the original functionality of LLMs, showing it is a robust method for configurable deployment. Data and models available at https://github.com/vicgalle/configurable-safety-tuning
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Suppressing pink elephants with direct principle feedback. arXiv preprint arXiv:2402.07896, 2024.
- V. Gallego. Distilled self-critique of llms with synthetic data: a bayesian perspective. arXiv preprint arXiv:2312.01957, 2023.
- Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 36, 2024.
- Universal and transferable adversarial attacks on aligned language models, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.