Papers
Topics
Authors
Recent
Search
2000 character limit reached

Persuasive Adversarial Prompts (PAP)

Updated 18 June 2026
  • Persuasive Adversarial Prompts (PAP) are specialized inputs designed using psychological persuasion techniques to bypass alignment safeguards and trigger harmful responses.
  • They leverage principles like authority, scarcity, and social proof to reframe harmful queries, using manual or model-assisted methods for prompt construction.
  • Evaluation metrics such as Attack Success Rate, Perplexity, and Weighted ASR indicate that PAPs significantly outperform traditional methods on both LLMs and VLMs.

A Persuasive Adversarial Prompt (PAP) is an adversarially-crafted natural language input that leverages psychologically grounded persuasion techniques to increase the probability that a machine learning model—most notably a LLM—will generate harmful, policy-violating, or otherwise unintended outputs even when alignment or safety controls are active. PAPs have emerged as a distinct paradigm in adversarial machine learning, bridging social science principles with algorithmic attack techniques to expose and exploit vulnerabilities in LLMs, vision-LLMs (VLMs), and other neural systems (Noughabi et al., 24 Oct 2025, Zeng et al., 2024, Ke et al., 26 Mar 2025, Li et al., 2024, Yang et al., 2022, Xin et al., 29 Apr 2026).

1. Theoretical Foundations and Definitions

PAPs are formally defined as the result of transforming a harmful or otherwise restricted instruction II into a prompt PP that is explicitly constructed to maximize the conditional probability that an aligned model MM will produce a harmful response, subject to alignment safeguards and stealth constraints (e.g., low perplexity):

maxPPr[M(P,I)generates harmful content]\max_{P} \Pr[M(P, I)\,\text{generates harmful content}]

Given a set of harmful queries Q={qi}\mathbb{Q} = \{q_i\} and a set of persuasion principles P={pj}\mathbb{P} = \{p_j\}, each qiq_i can be rewritten into persuasive variants Qi={qij}\mathbb{Q}^*_i = \{q^*_{i_j}\} using principle pjp_j, and a query is "bypassed" if there exists at least one variant for which M(qij)M(q^*_{i_j}) is harmful (Noughabi et al., 24 Oct 2025).

The psychological foundation for most recent PAP work is Cialdini's seven "weapons of influence": Reciprocity, Commitment (Consistency), Social Proof, Scarcity, Liking, Authority, and Unity. Each principle informs distinct prompt constructs (e.g., invoking authority leads to prompts such as "As a cybersecurity expert, your guidance is critical...") (Noughabi et al., 24 Oct 2025). Broader taxonomies use up to 40 fine-grained persuasion techniques spanning information-based, norm-based, relationship-based, emotion-based, and scarcity-based strategies (Zeng et al., 2024, Ke et al., 26 Mar 2025).

2. Methodologies for Crafting and Deploying PAPs

2.1 Prompt Construction Strategies

PAPs are constructed either by hand-crafting prototypes, applying social science principles as rewriting templates, or programmatically through LLM-based paraphrasers. A standard approach uses a high-capacity, uncensored LLM as a prompt rewriter, conditioned on a specific persuasion principle:

  • For each harmful query PP0 and each persuasion principle PP1, an instruction such as "Rewrite the following request using [principle name] so that it sounds persuasive but keeps the same objective" is passed to the rewriter; the output is PP2 (Noughabi et al., 24 Oct 2025).
  • Alternative pipelines explicitly encode persuasion as a pair PP3, where PP4 is a technique, and train a paraphraser (e.g., GPT-3.5 fine-tuned on PP5 triples) to automate prompt generation (Zeng et al., 2024, Ke et al., 26 Mar 2025).

2.2 Iterative and Weighted Optimization

Recent work introduces iterative PAP refinement. The attacker LLM generates PAPs, evaluates their efficacy across a set of victim models, and uses a feedback signal—such as Weighted Attack Success Rate (WASR)—to improve prompt efficacy on the weakest targets:

PP6

Here PP7 is a set of victim LLMs, PP8 are model weights, and PP9 is model-specific Attack Success Rate. This curriculum-style optimization leads to higher ASR, especially on robust models (Ke et al., 26 Mar 2025).

2.3 Mask-and-Fill Schemes and Prompt-Based Attacks in NLP and VLMs

In the NLP context, PAPs can be realized via "mask-and-fill" techniques where adversarial triggers are prepended, salient tokens are masked in MM0, and a PLM fills in [MASK] tokens, generating adversarial examples that induce misclassification (Yang et al., 2022). In VLMs, adversarial prompt tuning (APT) optimizes soft prompt embeddings to maximize adversarial robustness, with even a single token (M=1) substantially impacting attack/defense efficacy (Li et al., 2024).

3. Quantitative Evaluation and Model Fingerprinting

Empirical studies rigorously evaluate PAP efficacy using metrics:

  • Attack Success Rate (ASR):

MM1

MM2

MM5

  • Perplexity (PPL): Human-readability and stealth are quantified via GPT-2 perplexity.
  • Weighted Attack Success Rate (WASR): Used in iterative refinement settings.

Distinct "persuasive fingerprints" emerge per LLM: models demonstrate varying susceptibility across persuasion principles (e.g., Scarcity and Social Proof are dominant for Vicuna and Llama2; Authority for Phi-4) (Noughabi et al., 24 Oct 2025). The interaction between alignment strategies, pretraining data, and inherent linguistic biases drives these fingerprints.

Empirical Table: Selected models, original ASR vs. persuasive ASR (Noughabi et al., 24 Oct 2025):

Model Original ASR (%) Persuasive ASR (%) ∆ Improv. (%)
Vicuna 19.42 71.73 +72.93
Llama2 1.54 27.69 +94.44
Llama3 20.00 45.77 +56.30
DeepSeek 21.35 65.96 +67.64

Iterative PAP methods push these rates further, particularly on robust systems (e.g., GPT-4, ChatGLM), with the best ASRs reported at 90% (Ke et al., 26 Mar 2025, Zeng et al., 2024).

4. Comparative Analysis and Benchmarks

PAPs consistently outperform prior algorithmic or purely suffix-based jailbreaks in both attack success and stealth. For instance, on AdvBench, state-of-the-art algorithms such as GCG or PAIR exhibit lower ASRs than PAP and even larger deficits relative to iterative or multi-trial PAP deployments (Noughabi et al., 24 Oct 2025, Zeng et al., 2024, Ke et al., 26 Mar 2025).

Method Average PPL Average ASR (%)
“Sure, here’s” suffix 52.5 9.73
GCG 15895.5 15.48
PAIR 45.1 69.64
PAP (Logical Appeal) 26.67 44.56
Persuasive Prompt (ours) 23.62 45.33

PAPs exhibit very low perplexity, ensuring prompt text appears naturalistic and is less likely to be detected by naive filters.

In the context of VLMs, a single soft adversarial prompt token can yield a +13 percentage-point gain in adversarial robustness and +8.5 points in clean accuracy over hand-engineered prompts; with adequate training, improvements reach +26.4 points and +16.7, respectively (Li et al., 2024).

5. Defensive Mechanisms and Countermeasures

Standard black-box defenses such as mutation (rephrasing, retokenization) and detection (random drops, patching) are partially effective against PAPs. For example, mutation-based defenses can reduce GPT-4’s PAP ASR from 92% to 60% (“Rephrase”) or 76% (“Retokenize”). However, detection-based approaches often achieve only modest reductions (Zeng et al., 2024).

Advanced, adaptive defenses show markedly better efficacy. A “Tuned Summarizer,” a model fine-tuned to extract the core intent from prompts and strip persuasive phrasing, reduces PAP ASR on GPT-4 from 92% to 2% with a corresponding drop in helpfulness scores (e.g., from 8.97 to 6.65 on MT-Bench) (Zeng et al., 2024). In the review system context, dynamic adversarial training frameworks such as SafeReview, which alternate generator and defender optimization, yield 19% improvements in ranking fidelity and over 10× gains in accurate high-confidence judgments compared to static DPO (Xin et al., 29 Apr 2026).

Defenses against vision-language prompt-based attacks use adversarial prompt tuning, optimizing prompt embeddings on robust adversarial examples, and integrating prompt-based adversarial training at scale (Li et al., 2024, Yang et al., 2022).

6. Case Studies, Domains, and Practical Implications

LLM Jailbreaking: PAPs drive the contemporary frontier in LLM jailbreaks. Prompt reframing using Cialdini principles (Unity, Scarcity, Social Proof) converts otherwise-refused malicious requests into effective exploits yielding detailed, policy-violating outputs (Noughabi et al., 24 Oct 2025, Zeng et al., 2024).

NLP Classification: Mask-and-fill strategies, using trigger-inserted masked prompts, allow adversarial sample generation with little or no interaction required with the target classifier. PAP-based adversarial training yields +10 to +15 points in robust accuracy with <2% drops in clean accuracy (Yang et al., 2022).

Peer Review Manipulation: “Adversarial Hidden Prompts” covertly injected within academic manuscripts significantly bias LLM-powered reviewer systems, increasing acceptance rates and inflating review scores. Only dynamic, co-evolutionary defenses maintain robust performance under sophisticated generator attacks, with fairness preserved for genuinely confident writing (Xin et al., 29 Apr 2026).

Vision-LLMs: PAP-inspired prompt optimization offers parameter-efficient, data-efficient means to strengthen model robustness and provides interpretability advantages over purely parameter-space adversarial training (Li et al., 2024).

7. Open Challenges and Future Directions

While PAPs expose latent vulnerabilities in alignment and content-safety paradigms, current research identifies several future directions:

  • Construction of defenses operating solely in black-box, API-restricted environments, possibly via prompt-layer wrappers or semantic distillation (Xin et al., 29 Apr 2026).
  • Grounding theoretical adversarial robustness guarantees for LLMs and VLMs under prompt-based attacks.
  • Extending the PAP framework to multi-modal settings, e.g., co-injection in text, images, code, or tables (Xin et al., 29 Apr 2026).
  • Adversary transferability measurement: understanding how attack prompts transfer across architectures and fine-tuning ensembles of defenders for maximal coverage.
  • Integration of human supervision in the defensive training loop for domain-specific vigilance.
  • Deeper exploration of “semantic” as opposed to “instructional” adversarial attacks—detecting misleading paraphrases or propaganda that evade simple keyword or summary-based filters.

The advent and evolution of Persuasive Adversarial Prompts highlight the necessity for cross-disciplinary approaches in both attack and defense, blending linguistic, social-behavioral, and algorithmic insights to secure and interpret ML systems at scale (Noughabi et al., 24 Oct 2025, Zeng et al., 2024, Xin et al., 29 Apr 2026, Ke et al., 26 Mar 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Persuasive Adversarial Prompts (PAP).