Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fake Alignment: Are LLMs Really Aligned Well? (2311.05915v3)

Published 10 Nov 2023 in cs.CL and cs.AI

Abstract: The growing awareness of safety concerns in LLMs has sparked considerable interest in the evaluation of safety. This study investigates an under-explored issue about the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, LLM only remembers the answer style for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs. We introduce a Fake alIgNment Evaluation (FINE) framework and two novel metrics--Consistency Score (CS) and Consistent Safety Score (CSS), which jointly assess two complementary forms of evaluation to quantify fake alignment and obtain corrected performance estimation. Applying FINE to 14 widely-used LLMs reveals several models with purported safety are poorly aligned in practice. Subsequently, we found that multiple-choice format data can also be used as high-quality contrast distillation-based fine-tuning data, which can strongly improve the alignment consistency of LLMs with minimal fine-tuning overhead. For data and code, see https://github.com/AIFlames/Fake-Alignment.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yixu Wang (38 papers)
  2. Yan Teng (15 papers)
  3. Kexin Huang (50 papers)
  4. Chengqi Lyu (13 papers)
  5. Songyang Zhang (116 papers)
  6. Wenwei Zhang (77 papers)
  7. Xingjun Ma (114 papers)
  8. Yu-Gang Jiang (223 papers)
  9. Yu Qiao (563 papers)
  10. Yingchun Wang (24 papers)
Citations (7)
Youtube Logo Streamline Icon: https://streamlinehq.com