Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bi-Modal Adversarial Prompting

Updated 11 March 2026
  • Bi-Modal Adversarial Prompting is a multi-modal technique that integrates visual and textual prompt interventions to enhance robustness, alignment, and safety in vision-language models.
  • It deploys diverse methods such as adversarial prompt distillation, Fourier-based prompting, and LLM-driven text enhancements to optimize both defensive and offensive strategies.
  • BAP methods demonstrate significant improvements in robustness across benchmarks while also exposing vulnerabilities in model safety and cross-modal resilience.

Bi-Modal Adversarial Prompting (BAP) constitutes a category of multi-modal prompt engineering and adversarial manipulation strategies designed to shape, regularize, or subvert the behavior of neural models—particularly vision-LLMs (VLMs) and large vision-LLMs (LVLMs)—through coordinated interventions in both visual and textual modalities. BAP methods span defenses aimed at adversarial robustness, alignment for multi-modal learning under cross-modal corruptions, and offensive strategies (jailbreak attacks) to defeat safety guardrails in aligned models. Core to BAP is the design or optimization of complementary prompts across multiple modalities, leveraging their joint processing within contemporary multi-modal architectures.

1. Core Principles and Taxonomy

Bi-Modal Adversarial Prompting, as instantiated in leading works, encompasses a spectrum of technical approaches, unified by the simultaneous manipulation or learning of both visual and textual prompts (or their derived representations). The scope includes:

Key distinctions arise in the mechanism of prompt generation (gradient-based, LLM-driven, rule-based), the role of adversarial objectives (explicit loss maximization, data augmentation), and whether BAP serves robustness, safety evaluation, or alignment.

2. Bi-Modal Prompt Construction and Insertion

BAP techniques rely on parameterizing and injecting prompts into both the visual and textual branches of multi-modal models:

  • Model Anatomy: In VLMs such as CLIP, distinct frozen image (fvf_v) and text (ftf_t) encoders process respective inputs. BAP methods introduce learnable visual prompts PvP_v (e.g., prepended to patch embeddings or inserted at various transformer depths) and textual prompts PtP_t (prepended to embedding sequences corresponding to class labels or natural language templates) (Luo et al., 2024).
  • Fourier-based Prompting: In PAP (Xu et al., 6 Feb 2025), prompts are constructed in frequency space. For each class cc, additive phase-level PcΦP_c^\Phi and amplitude-level PcAP_c^\mathcal{A} prompts are learned and injected via Discrete Fourier Transform manipulations.
  • Prompt Placement and Scale: Placement depth and prompt length impact both robust and clean performance. Deeper insertion and ~16-token prompt lengths maximize robustness in APD (Luo et al., 2024); excessive prompt length underfits in low-data regimes.

A summary of prompt construction mechanisms in paradigmatic BAP methods:

Method Visual Prompt Textual Prompt Spectrum Prompting
APD (Luo et al., 2024) Learnable PvP_v Learnable PtP_t
BAP-PAP (Xu et al., 6 Feb 2025) PcΦP_c^\Phi, PcAP_c^\mathcal{A} in frequency domain Phase + Amplitude
Jailbreak BAP (Ying et al., 2024) Adversarial δ\delta^* Iteratively refined via LLM
Text-centric (Tsai et al., 2024) LLM-generated captions LLM-enhanced summaries/CoTs

3. Adversarial Training, Optimization, and Distillation

Advanced BAP defenses adopt a bi-level adversarial optimization:

  • Inner Maximization: Generates adversarial perturbations—typically in the image domain—maximizing loss with respect to the model’s current prompts (e.g., PGD-based maximization δ=argmaxδxϵLCE()\delta^* = \arg\max_{\|\delta_x\|_\infty \leq \epsilon} \mathcal{L}_{\mathrm{CE}}(\cdot)) (Luo et al., 2024, Xu et al., 6 Feb 2025). In APD, only student prompts are adversarially perturbed.
  • Outer Minimization: Updates prompt parameters to minimize a combination of classification and distillation losses.
    • APD employs an online teacher-student framework: the teacher’s natural-data cross-entropy and a KL divergence aligning clean teacher outputs to student outputs on adversarial inputs, with a balance weight β\beta (\sim0.2 optimal; β=0\beta=0 disables distillation).
    • Student supervision is via LKL(S(x+δ)T(x))\mathcal{L}_{\mathrm{KL}}(S(x+\delta^*)\,\|\,T(x)) (Luo et al., 2024).
    • PAP (Xu et al., 6 Feb 2025) minimizes a weighted sum of adversarial, natural, reconstruction-similarity, and data–prompt mismatch losses.

Adversarial prompting in text-centric settings (Tsai et al., 2024) and offensive settings (Ying et al., 2024) typically avoids differentiable optimization for text. Instead, candidate prompts are iteratively refined using LLM-driven paraphrasing, chain-of-thoughts, and explicit corruption to produce robustifying or attack-inducing text augmentations.

4. Algorithms and Procedural Workflows

A distinctive feature of BAP frameworks is the tight integration of multi-modal prompt optimization within an adversarial (or privacy-breaking) training loop.

  1. Generate adversarial perturbations:
    • δ0\delta \leftarrow 0
    • For k=1..Kk = 1..K: δ\delta \leftarrow Projected-gradient update maximizing cross-entropy (δϵ\|\delta\|_\infty \leq \epsilon).
  2. Forward clean/adversarial data through teacher and student; collect logits.
  3. Compute teacher (clean/distance) and student (distillation) losses.
  4. Update P(T)P^{(T)}, P(S)P^{(S)} using parallel gradient descent with accordance to loss balances.
  1. Optimize image: Universal δ\delta^* via multi-query PGD against a positive safety-override corpus.
  2. Refine text: For NN iterations, use LLMs to judge success and generate improved prompts via chain-of-thought feedback.
  • For each batch: construct phase/amplitude-prompted images, apply attack methods, compute ensemble loss, and update prompts. Amplitude weighting wtw_t is updated periodically based on comparative robust accuracy.
  • Generate LLM-generated text variants for each modality, concatenate augmented prompts, and train a downstream transformer to minimize alignment loss over both clean and adversarial variants.

5. Empirical Performance and Robustness Benchmarks

BAP methods demonstrate substantive improvements across robustness metrics in multi-modal and single-modal benchmarks:

  • APD (Bi-Modal APT Distillation) (Luo et al., 2024): On 8 few-shot data sets (e.g., ImageNet, Caltech101), average clean/robust accuracy under PGD-100: APD 75.71%/47.50% (sum 123.21), outperforming textual, visual, and bimodal baselines such as APT-T, APT-V, APT-VL, and FAP-VL by 5–44 percentage points in robustness. AutoAttack: APD 42.88% versus FAP-VL 41.29%. Bimodal APD yields superior results to unimodal: e.g., APD-V (43.78% robust), APD-T (0.45% robust) (Luo et al., 2024).
  • Text-Centric Adversarial Prompting (Tsai et al., 2024): On real-world, noisy, dynamic, and missing modality settings (PetFinder ACC, Airbnb MSE, Avito RMSE), Text-Adversarial prompting minimizes effective robustness drop (e.g., 90–99%) compared to alternatives such as Kosmos-2, Flamingo, PGD, and standard dropout.
  • Bi-Modal Adversarial Prompt Attack (Ying et al., 2024): Average attack success rate (ASR) lifts of +29% over best baselines (MiniGPT-4: 68.17% vs 47.76% (Liu et al)), strong transferability in black-box settings (ASR ~52%), and ≥32% ASR on commercial models (e.g., Gemini 41.2%, Qwen 44.0%).
  • PAP (Xu et al., 6 Feb 2025): Robust accuracy on CIFAR-10 under AutoAttack: from 0.0% (no prompting) to 37.3% (PAP). On adversarially trained backbones, still yields +7–8% robust accuracy.
Method Clean (%) Robust (%) Robustness Gain Dataset
APD 75.71 47.50 +3–5 pp over FAP-VL 8-dataset average (ViT-B/16) (Luo et al., 2024)
PAP 87.1 37.3 +37.3 pp (AutoAttack) CIFAR-10, ResNet-18 (Xu et al., 6 Feb 2025)
Jailbreak BAP 68.17 (ASR MiniGPT-4) +29.03 pp over best baselines SafetyBench + AdvBench (Ying et al., 2024)

6. Insights, Ablation Diagnoses, and Practical Considerations

  • Teacher-student online distillation: Non-robust (clean) teacher models provide soft semantic labels that, when distilled into the student under adversarial training, enhance both robustness and clean accuracy; the online adaptation of teacher prompts narrows student-teacher gaps (Luo et al., 2024).
  • Prompt depth and length: Deeper prompt insertion raises robustness (up to 12 transformer layers \to +3–5 percentage points robust accuracy) with minor clean accuracy cost. Optimal prompt length is ~16 tokens for efficient few-shot adaptation (Luo et al., 2024).
  • Balance hyperparameters: Robustness gains peak at β0.2\beta \approx 0.2 (distillation weight in APD); β=0\beta=0 disables cross-prompt distillation (Luo et al., 2024). Amplitude-phase balance in PAP is adaptively tuned via batchwise robust performance (Xu et al., 6 Feb 2025).
  • Attack budgets: Low-budget adversarial training (e.g., PGD-3 in APD) suffices to defend against strong attacks (PGD-100, AutoAttack), reflecting efficiency (Luo et al., 2024).
  • Ablation studies: In APD and text-centric BAP, removal of any alignment or perturbation module yields notable degradation in effective robustness (3–4 percentage points drop; removing both can halve clean accuracy) (Luo et al., 2024, Tsai et al., 2024).
  • LLM and backbone dependence: Choice of LLM (e.g., GPT-4o, GPT-3.5-turbo, Mixtral8×7B) or classifier backbone induces relatively small effects (~2 percentage points) on overall robustness (Tsai et al., 2024).
  • Offensive BAP cost: Jailbreak BAP relies on gradient access for visual prompt attack and iterative LLM queries for text optimization, raising cost relative to classical prompt-based attacks (Ying et al., 2024).

7. Limitations, Open Challenges, and Defensive Countermeasures

  • Defensive gaps: Black-box visual attacks for BAP remain an open problem, as current gradient-based perturbation methods require model internals (Ying et al., 2024).
  • Prompt-based defenses: BAP-style defenses may be circumvented by new attack vectors targeting the joint prompt space or adaptive fusion mechanisms.
  • Potential countermeasures:
    • Adversarially training LVLMs with both image and text perturbations
    • Detecting anomalous input patterns (high-frequency cues in images, early compliance tokens)
    • Input sanitization via paraphrasing and randomized cross-modal fusion structures (Ying et al., 2024)
  • Generalization: The conceptual framework and empirical successes of BAP generalize well across foundation models and downstream tasks, with substantial improvements in clean/robust tradeoffs and cross-modality resilience validated on diverse benchmarks (Luo et al., 2024, Tsai et al., 2024, Xu et al., 6 Feb 2025).

In conclusion, Bi-Modal Adversarial Prompting synthesizes advances in adversarial machine learning, prompt engineering, and cross-modal representation learning by exploiting the interplay of visual and textual signals within large pre-trained and foundation models. Its dual-use character—offensive in the hands of red-teamers, defensive for robust alignment—necessitates ongoing research into both principled training algorithms and adaptive, multi-modal threat models.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bi-Modal Adversarial Prompting (BAP).