Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Decoupled Alignment for Robust Plug-and-Play Adaptation (2406.01514v3)

Published 3 Jun 2024 in cs.CL, cs.AI, and cs.CR

Abstract: We introduce a low-resource safety enhancement method for aligning LLMs without the need for supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). Our main idea is to exploit knowledge distillation to extract the alignment information from existing well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play fashion. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned pre-trained LLMs, without compromising performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Haozheng Luo (16 papers)
  2. Jiahao Yu (23 papers)
  3. Wenxin Zhang (27 papers)
  4. Jialong Li (36 papers)
  5. Jerry Yao-Chieh Hu (26 papers)
  6. Han Liu (340 papers)
  7. Xinyu Xing (34 papers)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com