Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak (2406.11668v1)

Published 17 Jun 2024 in cs.CL

Abstract: "Jailbreak" is a major safety concern of LLMs, which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigation strategies. However, our research reveals that many jailbreaks identified by current evaluations may actually be hallucinations-erroneous outputs that are mistaken for genuine safety breaches. This finding suggests that some perceived vulnerabilities might not represent actual threats, indicating a need for more precise red teaming benchmarks. To address this problem, we propose the $\textbf{B}$enchmark for reli$\textbf{AB}$ilit$\textbf{Y}$ and jail$\textbf{B}$reak ha$\textbf{L}$l$\textbf{U}$cination $\textbf{E}$valuation (BabyBLUE). BabyBLUE introduces a specialized validation framework including various evaluators to enhance existing jailbreak benchmarks, ensuring outputs are useful malicious instructions. Additionally, BabyBLUE presents a new dataset as an augmentation to the existing red teaming benchmarks, specifically addressing hallucinations in jailbreaks, aiming to evaluate the true potential of jailbroken LLM outputs to cause harm to human society.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Lingrui Mei (20 papers)
  2. Shenghua Liu (33 papers)
  3. Yiwei Wang (120 papers)
  4. Baolong Bi (23 papers)
  5. Jiayi Mao (6 papers)
  6. Xueqi Cheng (274 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.