Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses (2407.02551v2)

Published 2 Jul 2024 in cs.CR, cs.AI, and cs.CY

Abstract: Vulnerability of Frontier LLMs to misuse and jailbreaks has prompted the development of safety measures like filters and alignment training in an effort to ensure safety through robustness to adversarially crafted prompts. We assert that robustness is fundamentally insufficient for ensuring safety goals, and current defenses and evaluation methods fail to account for risks of dual-intent queries and their composition for malicious goals. To quantify these risks, we introduce a new safety evaluation framework based on impermissible information leakage of model outputs and demonstrate how our proposed question-decomposition attack can extract dangerous knowledge from a censored LLM more effectively than traditional jailbreaking. Underlying our proposed evaluation method is a novel information-theoretic threat model of inferential adversaries, distinguished from security adversaries, such as jailbreaks, in that success is measured by inferring impermissible knowledge from victim outputs as opposed to forcing explicitly impermissible outputs from the victim. Through our information-theoretic framework, we show that to ensure safety against inferential adversaries, defense mechanisms must ensure information censorship, bounding the leakage of impermissible information. However, we prove that such defenses inevitably incur a safety-utility trade-off.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (5)

David Glukhov (4 papers)
Ziwen Han (9 papers)
Ilia Shumailov (72 papers)
Vardan Papyan (26 papers)
Nicolas Papernot (123 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/DavidGlukhov/status/1834403092579651729

https://twitter.com/usmananwar391/status/1887667514672308477

https://twitter.com/realmofresearch/status/1809813338794389906

https://twitter.com/cycloarkane/status/1835189924024209832

https://twitter.com/DavidGlukhov/status/1862732614655697202

Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses (2407.02551v2)

Related Papers

Tweets