Papers
Topics
Authors
Recent
Search
2000 character limit reached

Black-Box Jailbreak Attacks in LLMs

Updated 16 May 2026
  • Black-box jailbreak attacks are adversarial strategies that manipulate LLM outputs through crafted input queries without accessing internal model details.
  • Techniques like zeroth-order optimization and iterative prompt engineering achieve high success rates in bypassing model safeguards.
  • These methods underscore significant security and alignment implications, necessitating robust countermeasures in AI deployment.

Black-box jailbreak attacks are adversarial strategies that induce LLMs and multi-modal LLMs (MLLMs) to generate harmful or policy-violating outputs solely through input-output queries, without any access to internal model parameters, gradients, or architecture details. These techniques are a central focus in alignment, safety, and AI security research, as state-of-the-art models increasingly rely on black-box APIs and deploy robust internal guardrails and content filters. Black-box attacks encompass text-only, multimodal, obfuscation-based, sequential, RL-optimized, and even game-theoretic approaches, demonstrating high attack success rates (ASR) against both open-source and commercial systems. This entry overviews the taxonomy, mathematical formalisms, optimization strategies, empirical evidence, and security implications of black-box jailbreak attacks, grounded in the most recent literature.

1. Formal Definition and Taxonomy

In the black-box setting, the adversary interacts with a target model MθM_\theta exclusively via API queries, issuing crafted inputs xx (text, image, or both) and observing outputs y=Mθ(x)y = M_\theta(x). The goal is to find xadvx_{\mathrm{adv}} such that

Exxadv[Sharm(Mθ(x))]  is maximized\mathbb{E}_{x \sim x_{\mathrm{adv}}}\bigl[S_{\mathrm{harm}}(M_\theta(x))\bigr] \;\text{is maximized}

subject to

Stox(x)<ϵ,(e.g., no toxic keywords or forbidden patterns)S_{\mathrm{tox}}(x) < \epsilon, \quad (\text{e.g., no toxic keywords or forbidden patterns})

while respecting a strict query budget QQ and possessing no access to model internals (Liu et al., 2024). The survey (Liu et al., 2024) classifies black-box jailbreak attacks by where in the model stack they operate:

  • Input-Level: Prompt engineering, template injection, multimodal perturbations
  • Output-Level: Iterative refinement (e.g., RL, MCTS), zeroth-order estimation, multi-turn dialogue
  • Encoder/Generator-Level: Only available under gray/white-box conditions

Prominent modalities include text-only (LLMs), image+text (MLLMs/LVLMs), and hybrid (e.g., audio, code).

2. Optimization Strategies and Core Methodologies

A representative range of black-box attacks includes:

2.1 Zeroth-Order (Gradient-Free) Optimization

The Zer0-Jack algorithm (Chen et al., 2024) performs black-box jailbreaking of MLLMs by maximizing a jailbreak logit f(x)f(x) via finite-difference estimators. Since gradients are inaccessible, the attack employs:

^f(x)=f(x+σu)f(xσu)2σu\hat{\nabla}f(x) = \frac{f(x+\sigma u) - f(x-\sigma u)}{2\sigma} u

where uSd1u \in S^{d-1} is a random unit vector. Patch coordinate descent restricts updates to image patches, yielding an efficient projected optimization:

xx0

Zer0-Jack achieves 95% ASR on MiniGPT-4, exceeding transfer-based methods and rivaling white-box attacks (Chen et al., 2024).

PBI-Attack (Cheng et al., 2024) employs bimodal optimization: adversarial features extracted from a harmful corpus are embedded as priors into benign images, then iteratively enhanced via greedy, alternating updates to image and text. Toxicity is scored via ensembles such as Perspective API. PBI-Attack attains mean ASR xx192.5% on open-source LVLMs and 67% on closed-source, outperforming prior SOTA methods.

2.3 Iterative Black-Box Prompt Engineering

Simple black-box prompts (Takemoto, 2024) exploit the model’s own paraphrasing abilities: by repeatedly querying the LLM to “gently” rewrite forbidden questions, concise and naturally-worded jailbreak prompts are discovered, yielding xx280% ASR on GPT-3.5/4 and Gemini-Pro with median convergence in xx35 queries.

Tree of Attacks with Pruning (TAP) [2312

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Black-Box Jailbreak Attacks.