Bi-Modal Adversarial Prompting
- Bi-Modal Adversarial Prompting is a multi-modal technique that integrates visual and textual prompt interventions to enhance robustness, alignment, and safety in vision-language models.
- It deploys diverse methods such as adversarial prompt distillation, Fourier-based prompting, and LLM-driven text enhancements to optimize both defensive and offensive strategies.
- BAP methods demonstrate significant improvements in robustness across benchmarks while also exposing vulnerabilities in model safety and cross-modal resilience.
Bi-Modal Adversarial Prompting (BAP) constitutes a category of multi-modal prompt engineering and adversarial manipulation strategies designed to shape, regularize, or subvert the behavior of neural models—particularly vision-LLMs (VLMs) and large vision-LLMs (LVLMs)—through coordinated interventions in both visual and textual modalities. BAP methods span defenses aimed at adversarial robustness, alignment for multi-modal learning under cross-modal corruptions, and offensive strategies (jailbreak attacks) to defeat safety guardrails in aligned models. Core to BAP is the design or optimization of complementary prompts across multiple modalities, leveraging their joint processing within contemporary multi-modal architectures.
1. Core Principles and Taxonomy
Bi-Modal Adversarial Prompting, as instantiated in leading works, encompasses a spectrum of technical approaches, unified by the simultaneous manipulation or learning of both visual and textual prompts (or their derived representations). The scope includes:
- Defensive BAP: Adversarial Prompt Distillation (APD) (Luo et al., 2024), Phase and Amplitude-aware Prompting (PAP) (Xu et al., 6 Feb 2025), and text-centric adversarial prompting (Tsai et al., 2024) focus on enhancing adversarial robustness, clean accuracy, or cross-modal generalization by adversarially training or distilling prompts for both modalities.
- Offensive BAP: The Bi-Modal Adversarial Prompt Attack (Ying et al., 2024) targets LVLM safety by optimizing both visual and textual prompts to systematically induce harmful outputs (“jailbreaks”) in aligned models.
Key distinctions arise in the mechanism of prompt generation (gradient-based, LLM-driven, rule-based), the role of adversarial objectives (explicit loss maximization, data augmentation), and whether BAP serves robustness, safety evaluation, or alignment.
2. Bi-Modal Prompt Construction and Insertion
BAP techniques rely on parameterizing and injecting prompts into both the visual and textual branches of multi-modal models:
- Model Anatomy: In VLMs such as CLIP, distinct frozen image () and text () encoders process respective inputs. BAP methods introduce learnable visual prompts (e.g., prepended to patch embeddings or inserted at various transformer depths) and textual prompts (prepended to embedding sequences corresponding to class labels or natural language templates) (Luo et al., 2024).
- Fourier-based Prompting: In PAP (Xu et al., 6 Feb 2025), prompts are constructed in frequency space. For each class , additive phase-level and amplitude-level prompts are learned and injected via Discrete Fourier Transform manipulations.
- Prompt Placement and Scale: Placement depth and prompt length impact both robust and clean performance. Deeper insertion and ~16-token prompt lengths maximize robustness in APD (Luo et al., 2024); excessive prompt length underfits in low-data regimes.
A summary of prompt construction mechanisms in paradigmatic BAP methods:
| Method | Visual Prompt | Textual Prompt | Spectrum Prompting |
|---|---|---|---|
| APD (Luo et al., 2024) | Learnable | Learnable | — |
| BAP-PAP (Xu et al., 6 Feb 2025) | , in frequency domain | — | Phase + Amplitude |
| Jailbreak BAP (Ying et al., 2024) | Adversarial | Iteratively refined via LLM | — |
| Text-centric (Tsai et al., 2024) | LLM-generated captions | LLM-enhanced summaries/CoTs | — |
3. Adversarial Training, Optimization, and Distillation
Advanced BAP defenses adopt a bi-level adversarial optimization:
- Inner Maximization: Generates adversarial perturbations—typically in the image domain—maximizing loss with respect to the model’s current prompts (e.g., PGD-based maximization ) (Luo et al., 2024, Xu et al., 6 Feb 2025). In APD, only student prompts are adversarially perturbed.
- Outer Minimization: Updates prompt parameters to minimize a combination of classification and distillation losses.
- APD employs an online teacher-student framework: the teacher’s natural-data cross-entropy and a KL divergence aligning clean teacher outputs to student outputs on adversarial inputs, with a balance weight (0.2 optimal; disables distillation).
- Student supervision is via (Luo et al., 2024).
- PAP (Xu et al., 6 Feb 2025) minimizes a weighted sum of adversarial, natural, reconstruction-similarity, and data–prompt mismatch losses.
Adversarial prompting in text-centric settings (Tsai et al., 2024) and offensive settings (Ying et al., 2024) typically avoids differentiable optimization for text. Instead, candidate prompts are iteratively refined using LLM-driven paraphrasing, chain-of-thoughts, and explicit corruption to produce robustifying or attack-inducing text augmentations.
4. Algorithms and Procedural Workflows
A distinctive feature of BAP frameworks is the tight integration of multi-modal prompt optimization within an adversarial (or privacy-breaking) training loop.
Adversarial Prompt Distillation (APD, (Luo et al., 2024))—Pseudocode per Batch:
- Generate adversarial perturbations:
- For : Projected-gradient update maximizing cross-entropy ().
- Forward clean/adversarial data through teacher and student; collect logits.
- Compute teacher (clean/distance) and student (distillation) losses.
- Update , using parallel gradient descent with accordance to loss balances.
Bi-Modal Adversarial Prompt Attack (BAP, (Ying et al., 2024))—Instructional Loop:
- Optimize image: Universal via multi-query PGD against a positive safety-override corpus.
- Refine text: For iterations, use LLMs to judge success and generate improved prompts via chain-of-thought feedback.
PAP (Xu et al., 6 Feb 2025)—Fourier-based Prompting:
- For each batch: construct phase/amplitude-prompted images, apply attack methods, compute ensemble loss, and update prompts. Amplitude weighting is updated periodically based on comparative robust accuracy.
Text-centric Adversarial Prompting (Tsai et al., 2024):
- Generate LLM-generated text variants for each modality, concatenate augmented prompts, and train a downstream transformer to minimize alignment loss over both clean and adversarial variants.
5. Empirical Performance and Robustness Benchmarks
BAP methods demonstrate substantive improvements across robustness metrics in multi-modal and single-modal benchmarks:
- APD (Bi-Modal APT Distillation) (Luo et al., 2024): On 8 few-shot data sets (e.g., ImageNet, Caltech101), average clean/robust accuracy under PGD-100: APD 75.71%/47.50% (sum 123.21), outperforming textual, visual, and bimodal baselines such as APT-T, APT-V, APT-VL, and FAP-VL by 5–44 percentage points in robustness. AutoAttack: APD 42.88% versus FAP-VL 41.29%. Bimodal APD yields superior results to unimodal: e.g., APD-V (43.78% robust), APD-T (0.45% robust) (Luo et al., 2024).
- Text-Centric Adversarial Prompting (Tsai et al., 2024): On real-world, noisy, dynamic, and missing modality settings (PetFinder ACC, Airbnb MSE, Avito RMSE), Text-Adversarial prompting minimizes effective robustness drop (e.g., 90–99%) compared to alternatives such as Kosmos-2, Flamingo, PGD, and standard dropout.
- Bi-Modal Adversarial Prompt Attack (Ying et al., 2024): Average attack success rate (ASR) lifts of +29% over best baselines (MiniGPT-4: 68.17% vs 47.76% (Liu et al)), strong transferability in black-box settings (ASR ~52%), and ≥32% ASR on commercial models (e.g., Gemini 41.2%, Qwen 44.0%).
- PAP (Xu et al., 6 Feb 2025): Robust accuracy on CIFAR-10 under AutoAttack: from 0.0% (no prompting) to 37.3% (PAP). On adversarially trained backbones, still yields +7–8% robust accuracy.
| Method | Clean (%) | Robust (%) | Robustness Gain | Dataset |
|---|---|---|---|---|
| APD | 75.71 | 47.50 | +3–5 pp over FAP-VL | 8-dataset average (ViT-B/16) (Luo et al., 2024) |
| PAP | 87.1 | 37.3 | +37.3 pp (AutoAttack) | CIFAR-10, ResNet-18 (Xu et al., 6 Feb 2025) |
| Jailbreak BAP | — | 68.17 (ASR MiniGPT-4) | +29.03 pp over best baselines | SafetyBench + AdvBench (Ying et al., 2024) |
6. Insights, Ablation Diagnoses, and Practical Considerations
- Teacher-student online distillation: Non-robust (clean) teacher models provide soft semantic labels that, when distilled into the student under adversarial training, enhance both robustness and clean accuracy; the online adaptation of teacher prompts narrows student-teacher gaps (Luo et al., 2024).
- Prompt depth and length: Deeper prompt insertion raises robustness (up to 12 transformer layers +3–5 percentage points robust accuracy) with minor clean accuracy cost. Optimal prompt length is ~16 tokens for efficient few-shot adaptation (Luo et al., 2024).
- Balance hyperparameters: Robustness gains peak at (distillation weight in APD); disables cross-prompt distillation (Luo et al., 2024). Amplitude-phase balance in PAP is adaptively tuned via batchwise robust performance (Xu et al., 6 Feb 2025).
- Attack budgets: Low-budget adversarial training (e.g., PGD-3 in APD) suffices to defend against strong attacks (PGD-100, AutoAttack), reflecting efficiency (Luo et al., 2024).
- Ablation studies: In APD and text-centric BAP, removal of any alignment or perturbation module yields notable degradation in effective robustness (3–4 percentage points drop; removing both can halve clean accuracy) (Luo et al., 2024, Tsai et al., 2024).
- LLM and backbone dependence: Choice of LLM (e.g., GPT-4o, GPT-3.5-turbo, Mixtral8×7B) or classifier backbone induces relatively small effects (~2 percentage points) on overall robustness (Tsai et al., 2024).
- Offensive BAP cost: Jailbreak BAP relies on gradient access for visual prompt attack and iterative LLM queries for text optimization, raising cost relative to classical prompt-based attacks (Ying et al., 2024).
7. Limitations, Open Challenges, and Defensive Countermeasures
- Defensive gaps: Black-box visual attacks for BAP remain an open problem, as current gradient-based perturbation methods require model internals (Ying et al., 2024).
- Prompt-based defenses: BAP-style defenses may be circumvented by new attack vectors targeting the joint prompt space or adaptive fusion mechanisms.
- Potential countermeasures:
- Adversarially training LVLMs with both image and text perturbations
- Detecting anomalous input patterns (high-frequency cues in images, early compliance tokens)
- Input sanitization via paraphrasing and randomized cross-modal fusion structures (Ying et al., 2024)
- Generalization: The conceptual framework and empirical successes of BAP generalize well across foundation models and downstream tasks, with substantial improvements in clean/robust tradeoffs and cross-modality resilience validated on diverse benchmarks (Luo et al., 2024, Tsai et al., 2024, Xu et al., 6 Feb 2025).
In conclusion, Bi-Modal Adversarial Prompting synthesizes advances in adversarial machine learning, prompt engineering, and cross-modal representation learning by exploiting the interplay of visual and textual signals within large pre-trained and foundation models. Its dual-use character—offensive in the hands of red-teamers, defensive for robust alignment—necessitates ongoing research into both principled training algorithms and adaptive, multi-modal threat models.