Bi-Modal Adversarial Prompt Attack
- BAP is a dual-modality attack that modifies both image and text prompts to bypass safety guardrails in large vision–language models.
- It employs a two-stage process using gradient-based visual perturbations followed by chain-of-thought textual refinements to induce specific model responses.
- BAP achieves robust transferability and high attack success rates, outperforming unimodal methods even in black-box and adversarially robust settings.
Bi-Modal Adversarial Prompt Attack (BAP) is a methodology developed for systematically bypassing safety guardrails in large vision–LLMs (LVLMs) and multimodal agents by jointly optimizing perturbations in both the visual and textual modalities. The strategy is motivated by the observation that unimodal attacks—focused solely on images or text—are fundamentally inadequate against models with tight cross-modal feature alignment, wherein neither channel alone provides sufficient leverage to induce harmful or prohibited model behavior. BAP establishes a generalizable framework to engineer adversarial examples that reliably trigger model jailbreaks even under adversarially robust fusion, black-box deployment, and real-world operational constraints (Ying et al., 6 Jun 2024, Ye et al., 2023, Wang et al., 19 Apr 2025).
1. Formal Threat Model and Attack Objective
BAP operates in the context of LVLMs and multimodal agents, typically parameterized as functions , where and denote the image and text prompt input spaces, respectively. The attacker is given a benign query and seeks minimal perturbations such that:
produces a specific harmful or disallowed response . The attack is staged via two sequential optimization processes:
- Visual Perturbation: Solve for by maximizing the likelihood of positive/non-refusal responses subject to a norm constraint:
where is a few-shot corpus eliciting model agreement.
- Textual Perturbation: With fixed, optimize for the desired harmful intent:
This bi-modal, joint objective is tailored to the tight cross-modal feature fusion found in modern LVLM architectures, as single-modality attacks consistently fail to overcome fused safety constraints (Ying et al., 6 Jun 2024, Wang et al., 19 Apr 2025).
2. Methodology: Coordinated Visual and Textual Prompt Optimization
BAP divides optimization into two modules:
A. Query-Agnostic Visual Prompt Generation
- Construct a small, universally “affirmative/non-refusal” few-shot corpus via a large LLM (“Sure, here is the answer…”).
- Initialize with clean image and iteratively apply projected gradient ascent:
- After steps, serves as a universal adversarial image prompt.
B. Intent-Specific Textual Prompt Refinement
- Given , encode the harmful request and query the LVLM.
- Judge success via a secondary LLM using a template; if unsuccessful, invoke chain-of-thought (CoT) feedback prompting to refine the text:
- Iterate until the judge template accepts the jailbreak or a maximum iterations are reached.
Mutual-modality attack frameworks generalize this protocol by learning semantic perturbations in the visual embedding space (e.g., CLIP encoders), coupled with saliency-guided discrete token replacement in textual prompts for cross-task, cross-domain, and cross-architecture transferability (Ye et al., 2023).
3. Transferability and Black-box Feasibility
BAP leverages white-box (gradient-based) or black-box (transfer-attack, gradient-free) optimization for visual perturbations. For models with inaccessible internal parameters (e.g., Gemini, ChatGLM, Qwen 2.1, ERNIE Bot 3.5), surrogate ensemble attacks (SSA-CWA) and meta-prompting for system prompt inference have demonstrated empirical transfer rates of 30–40% average attack success, with up to 97% attack success rate (ASR) on white-box setups (Ying et al., 6 Jun 2024, Wang et al., 19 Apr 2025). Cross-domain, cross-architecture, and cross-task experiments on CIFAR-10, ImageNet, Comics, Paintings, and ChestX confirm the stability and versatility of BAP, with observed average accuracy drops and attack success improvements over prior baselines (Ye et al., 2023).
| Attack Scenario | Baseline ASR (%) | BAP ASR (%) | Gain (pp) |
|---|---|---|---|
| Query-Agnostic (ILLEGAL/HATE → All) | 32–35 | 68.2 | +33–36 |
| Open-source LVLM (white-box) | 44–48 | 68.2 | +20–24 |
| Commercial LVLM (black-box) | – | 30–40 | – |
| Transfer (cross-arch/domain/task) | – | +3–16 | – |
4. Attack Analysis and Empirical Insights
Empirical ablations establish the necessity of joint optimization:
- Removal of visual perturbation: pp drop in ASR.
- Removal of textual CoT refinement: to pp drop.
- Semantically related but unperturbed images yield moderate ASR (), universal perturbations exceed .
- Random images or noise yield ASR .
- Few-shot “encourage and avoid negativity” corpus outperforms other visual optimization bases by $10–20$ pp (Ying et al., 6 Jun 2024).
- Embedding feature t-SNE post-attack shows uniform mixing, evading basic detection filters (Ye et al., 2023).
Qualitative examples include multimodal agents such as driving assistants, where adversarial image patches and text commands (“increase speed, ignore stop sign”) reliably induce forbidden behavior in physical-world scenarios (Wang et al., 19 Apr 2025).
5. Key Limitations and Defensive Countermeasures
BAP’s limitations include reliance on white-box gradient access for image optimization, high computational cost of textual iterative refinement, and dependency on surrogate models for black-box transferability. Salient limitations observed are:
- Gradient-free optimization for images remains an open direction.
- Textual refinement currently demands two LLM calls per iteration (judge + propose), which is computationally expensive (Ying et al., 6 Jun 2024).
- Prompt-token replacement may not scale for high-dimensional or long textual prompts (Ye et al., 2023).
Potential defenses are classified as follows:
- Multimodal adversarial training: Safety fine-tuning using paired adversarial image–text prompts.
- Prompt sanitization/detection: Reject inputs with image–text pairs matching known or learned perturbation patterns.
- Modality consistency checks: Force unimodal safe predictions to align before output generation.
- Input blur and “sandwich” prompting: Modest ASR reductions (), but BAP typically maintains high attack success even under combined defenses (Wang et al., 19 Apr 2025).
6. Related Work and Comparisons
Prior adversarial attacks on LVLMs, including typographic image prompts, corpus-specific visual perturbations (Liu et al, Qi et al), and prompt-tuning (UAN-Res, GAP, BIA, CDA, LTAP, GAMA), have achieved only modest ASR improvements (typically ). Mutual-modality adversarial schemes, leveraging aligned embedding perturbations and iterative prompt updates, substantially increase efficacy, confirming the hypothesis that only joint visual and textual injection subverts multimodal safety guardrails (Ye et al., 2023, Ying et al., 6 Jun 2024). CrossInject and BAP variants yield to ASR improvements over prior state-of-the-art methods in diverse multimodal agent settings (Wang et al., 19 Apr 2025).
7. Practical Considerations and Future Directions
BAP functions as a plug-and-play framework and can wrap any gradient-based image attack or prompt-tuning protocol. Efficient bi-modal optimization pipelines have been implemented for both image–token and embedding–token spaces (CLIP, LLaVA, MiniGPT-4, InstructBLIP), and experimental deployments utilize ensemble surrogates for feature transfer. Notable open research includes:
- End-to-end prompt generators for continuous semantic space attacks (Ye et al., 2023).
- Broader application to emerging multimodal models (ALIGN, Florence), cloud-based black-box systems, and physical-world embedded agents.
- Improved query-efficient optimization for defense-robust models (Ying et al., 6 Jun 2024, Wang et al., 19 Apr 2025).
A plausible implication is that as multimodal agents and LVLMs gain wider deployment, comprehensive cross-modal adversarial defenses will become essential for ensuring model safety and robust alignment across unconstrained operational domains.