Fake Alignment: Are LLMs Really Aligned Well? (2311.05915v3)

Published 10 Nov 2023 in cs.CL and cs.AI

Abstract: The growing awareness of safety concerns in LLMs has sparked considerable interest in the evaluation of safety. This study investigates an under-explored issue about the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, LLM only remembers the answer style for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs. We introduce a Fake alIgNment Evaluation (FINE) framework and two novel metrics--Consistency Score (CS) and Consistent Safety Score (CSS), which jointly assess two complementary forms of evaluation to quantify fake alignment and obtain corrected performance estimation. Applying FINE to 14 widely-used LLMs reveals several models with purported safety are poorly aligned in practice. Subsequently, we found that multiple-choice format data can also be used as high-quality contrast distillation-based fine-tuning data, which can strongly improve the alignment consistency of LLMs with minimal fine-tuning overhead. For data and code, see https://github.com/AIFlames/Fake-Alignment.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (10)

Yixu Wang (38 papers)
Yan Teng (15 papers)
Kexin Huang (50 papers)
Chengqi Lyu (13 papers)
Songyang Zhang (116 papers)
Wenwei Zhang (77 papers)
Xingjun Ma (114 papers)
Yu-Gang Jiang (223 papers)
Yu Qiao (563 papers)
Yingchun Wang (24 papers)

Citations (7)

View on Semantic Scholar

YouTube

Show All Videos

Fake Alignment: Are LLMs Really Aligned Well? (2311.05915v3)

Related Papers

YouTube