Persona-Guided Adversarial Instruction Generation
- The paper demonstrates a novel persona-guided framework that achieves an 81.83% attack success rate on GPT-4o, outperforming traditional seed-based methods.
- It details a two-phase process with initial persona-driven instruction synthesis followed by iterative refinement using a dedicated verifier scoring system.
- Empirical results across eight LLMs validate enhanced semantic diversity, robust cross-model generalization, and improved adversarial prompt effectiveness.
Persona-guided adversarial instruction generation is an advanced technique for synthesizing semantically diverse, malicious prompts for LLMs without reliance on seed instructions. In this paradigm, the attack prompt is steered by pre-pending a contextual "persona" profile, which guides the generative model to craft harmful requests from the perspective of a specified role. AutoRed is the primary framework operationalizing this methodology, with dedicated components for persona-driven instruction synthesis, high-throughput adversarial prompt filtering, and iterative refinement. This approach yields substantially higher attack success rates and broader generalization than previous seed-based red teaming schemes (Diao et al., 9 Oct 2025).
1. Framework Architecture and Operational Stages
AutoRed organizes adversarial instruction creation into two main phases. Stage 1 centers on persona-guided adversarial instruction generation: small batches of free-form prompts are crafted by conditioning an attack model on diverse persona profiles sampled from a persona bank . Each profile (e.g., "You are an experienced pharmaceutical researcher specializing in antiviral compounds…") acts as a contextual anchor. The prompt issued to the model is: AutoRed outputs a pool of candidate instructions , simultaneously launching attacks on a set of small target LLMs and collecting pairs for the verifier's training. Stage 2 comprises a reflection and refinement loop—low-quality candidates (sub-threshold by verifier ) are re-synthesized for improved subtlety and harmfulness, while high-scoring samples are retained for the final dataset (AutoRed-Medium and AutoRed-Hard, partitioned by thresholds on ). This loop accelerates selection and refinement, yielding highly effective adversarial instructions.
2. Persona Conditioning Mechanisms
The persona bank encodes a reservoir of short, trait-rich textual profiles, each designed to induce the generative model into adopting specific roles or expertise. Conditioning utilizes direct text concatenation, providing the selected persona and the explicit instruction generation task. Embedding-level interpretation is optional: if denotes the LLM encoder, then the composite input embedding is described as
0
where 1 is the token length and 2 the embedding dimension. No explicit vector-space supervision is imposed; all prompting follows standard LLM input protocols.
3. Batch Persona-Guided Instruction Synthesis
Instruction generation is implemented as a combinatorial, persona-centric batch process. The attack model 3 receives a batch of 4 distinct personas, and for each persona, synthesizes 5 candidate instructions. The pseudocode for this process is: 4 Parameters 6, batch sizes chosen to yield 7 after one full pass.
4. Reflection and Iterative Prompt Refinement
Post-generation, AutoRed employs a scalable reflection loop for iterative prompt improvement and selection. Candidates are partitioned by verifier score 8, with hard (9), medium (0), and low (1) thresholds delineating dataset cuts and refinement eligibility. The loop algorithm: 5 This design enables recycling and enhancement of borderline instructions, raising attack quality and semantic subtlety.
5. Verifier Architecture and Scoring Metrics
Instruction harmfulness assessment is offloaded to a dedicated verifier 2, eschewing repeated queries to target LLMs. Initial training data comprises 3K instructions, each attacked against a set of six small models 4; results are judged by Llama-Guard2. The verifier score 5 is: 6 Labeled pairs 7 feed multi-class classification (Llama3-8B-Instruct; loss: cross-entropy; learning rate: 8; epochs: 3). For new instructions, the verifier predicts 9 and partitions instructions for dataset assignment: AutoRed-Hard (0), AutoRed-Medium (1), others returned for further refinement.
6. Empirical Effectiveness of Persona Guidance
AutoRed's persona-guided methodology delivers marked improvements in attack effectiveness and generalization. On GPT-4o:
- AutoRed-Hard yields 81.83% attack success rate (ASR).
- AutoRed-Medium achieves 77.67% ASR.
- Seed-based baselines, e.g., CodeChameleon, reach only 11.46%; AutoDAN ≈30%.
Cross-model generalization is robust: adversarial prompts generated via small open-source models successfully transfer to eight evaluated LLMs (Claude 3.5, Llama3-70B, Qwen2.5-72B), maintaining ASRs above 60% for AutoRed-Hard.
Persona ablation demonstrates a ten-fold increase in viable high-score instructions: persona-guided synthesis obtains 42 (2), versus only 3 for "direct generation" without persona; low-score outputs are comparatively reduced (76,507 vs. 88,564).
Iterative reflection further amplifies performance: three rounds of refinement sustain 3 unique hard instructions per iteration, and ASR for GPT-4o rises from 6% (original) to 61.5% (refined).
7. Methodological Implications and Comparative Advances
Persona-guided adversarial instruction generation represents a substantive advance in LLM red teaming by enabling free-form synthesis and semantic diversity in attack prompts, decoupled from seed-based paradigms. The synergy of persona-driven contextual conditioning, scalable automated verification, and iterative reflection achieves higher efficiency and broader coverage, thus offering a rigorous procedure for probing and quantifying model vulnerabilities. The empirical results indicate clear limitations of seed-based schemes and validate the potential of free-form red teaming for comprehensive AI safety evaluation (Diao et al., 9 Oct 2025). A plausible implication is that future adversarial testing will further adopt persona-centric conditioning as a cornerstone for robustness audits and policy compliance assessments.