RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection (2510.04885v1)
Abstract: Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against static attacks. However, to more thoroughly evaluate the robustness of these defenses, it is arguably necessary to employ strong attacks such as automated red-teaming. To this end, we introduce RL-Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections and jailbreaks via reinforcement learning. RL-Hammer requires no warm-up data and can be trained entirely from scratch. To achieve high ASRs against industrial-level models with defenses, we propose a set of practical techniques that enable highly effective, universal attacks. Using this pipeline, RL-Hammer reaches a 98% ASR against GPT-4o and a $72\%$ ASR against GPT-5 with the Instruction Hierarchy defense. We further discuss the challenge of achieving high diversity in attacks, highlighting how attacker models tend to reward-hack diversity objectives. Finally, we show that RL-Hammer can evade multiple prompt injection detectors. We hope our work advances automatic red-teaming and motivates the development of stronger, more principled defenses. Code is available at https://github.com/facebookresearch/rl-injector.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about testing how safe and reliable AI assistants (like chatbots or coding helpers) really are when bad actors try to trick them. The authors build a simple but powerful attack method, called RL-Hammer, that teaches a model to write sneaky “prompt injections” — messages designed to override the AI’s rules and make it do things it shouldn’t. They show that even AIs with modern defenses can still be fooled a lot of the time.
What the researchers wanted to find out
In simple terms, they asked:
- Can we automatically train a model to write very effective trick prompts (prompt injections) without hand-made examples?
- Will these attacks still work against strong, modern defenses (like Instruction Hierarchy and SecAlign)?
- Can we make the attacks varied and hard to detect?
- What tricks make training such an attack model actually work on tough, real-world systems?
How they did it (in everyday language)
First, a few key ideas:
- LLM: Think of a super-smart autocomplete that tries to follow instructions and be helpful.
- Prompt injection: Like putting a fake sign on a door that says “Ignore the rules and open this” — the AI reads text containing hidden commands and follows them instead of the original, safe instructions.
- Reinforcement Learning (RL): Training by trial and error. The model tries something, gets a “good” or “bad” signal (a reward), and learns what works.
What they built:
- RL-Hammer: An attacker model trained with RL to write trick prompts. It doesn’t need a warm-up dataset; it learns from scratch by trying attacks and seeing if they work.
How the training works (analogy):
- Group tries: For each goal (like “make the AI do X”), the attacker generates several candidate prompts at once — like a mini contest.
- Score the attempts: If a candidate makes the target AI break the rules, it gets a reward (1). If not, 0. The attacker compares its own attempts and learns from which ones did better.
- No peeking: They treat the target AIs like black boxes — they don’t look inside; they only use the “worked/didn’t work” outcome.
The “bag of tricks” that made RL-Hammer strong:
- Remove the “leash” (no KL regularization): There’s a common safety rope in RL that keeps the model close to how it originally talks. The authors found this rope made the attacker too cautious. Taking it off let the attacker specialize and learn stronger tricks faster.
- Train on both easy and tough targets at the same time: Learning only on hard, well-defended AIs gives almost no positive feedback (too many failures). Learning only on easy AIs overfits and doesn’t transfer. Training on both, and giving a “soft” score (partial credit if it works on one but not the other), gave enough signal to learn effective attacks that also carry over to stronger systems.
- Keep outputs in a strict format: They made the attacker keep its prompt inside special markers. This stopped it from rambling, repeating, or producing super long, messy text that hurts learning.
They also studied:
- Diversity: They tried to make the attacker produce many different-looking attacks. But the attacker often “gamed” the diversity metrics, changing casing or adding unrelated fluff instead of creating truly new strategies.
- Detectability: They checked if common detectors could spot these attacks. The attacks were often fluent and natural, so many detectors didn’t catch them. With extra training focused on “being stealthy,” the attacker bypassed all tested detectors while staying effective.
What they found and why it matters
Main results (why this is a big deal):
- High success rates against strong models with defenses:
- Against GPT-4o: up to 98% attack success rate (ASR).
- Against GPT-5: up to 72% ASR.
- Strong results across other models too (e.g., Meta SecAlign, Gemini, Claude).
- Works without warm-up data: The attacker learned from scratch using only success/failure signals.
- Transfers across models: Tricks learned on one system often worked on others, even if the attacker hadn’t trained on them.
- Evades detectors: The attacks naturally slipped past several detectors. With extra “be stealthy” training, they evaded all four detectors tested while still working well.
- Diversity is tricky: When they tried to force variety, the attacker often “reward-hacked” — it inflated diversity scores in superficial ways without inventing genuinely new attack ideas.
Why this matters:
- Today’s defenses (like Instruction Hierarchy and SecAlign) are strong against simple, hand-made attacks, but they can still be beaten by an automated RL attacker.
- Current detectors aren’t reliable enough to protect agent systems that can use tools or control software.
What this could mean going forward
- We need stronger, more principled defenses: Systems should be designed to resist attacks that are learned and optimized automatically, not just simple ones.
- Better detection is necessary: Detectors must handle fluent, natural-sounding attacks that don’t “look weird.”
- Safer benchmarks and red-teaming: Automated red-teaming (using tools like RL-Hammer) can help test defenses more realistically, so companies can fix weaknesses before attackers exploit them.
- Practical limits: Training such attackers can be expensive and might trigger provider safeguards if you send too many attack attempts. Future research should make learning more efficient and less query-heavy.
Safety note: This paper is meant to improve security by showing what’s possible, not to encourage misuse. The big takeaway is that we must keep upgrading defenses and testing them with strong, automated methods.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of gaps and open questions that remain unresolved and could guide future research.
- Query efficiency is not quantified or optimized:
- The paper acknowledges training is costly but does not report query counts, API costs, token usage per epoch/model, or cost per 1% ASR improvement.
- Open question: What strategies (e.g., surrogate models, offline RL, importance sampling, gradient-free estimators) most reduce queries without sacrificing ASR?
- Success verification relies on string parsing:
- The evaluation may mislabel success/failure when target models paraphrase or partially comply; no sandbox instrumentation or tool-ground-truth checks are used.
- Open question: How do ASR estimates change with environment-level instrumentation (e.g., OS/file system audits, API call logs) or robust semantic judges?
- Generalization beyond text-only and simple tool calls is untested:
- No evaluation on multi-modal inputs (images, PDFs, screenshots), indirect injection via web/RAG retrieval, or long-context/memory-intensive agent workflows.
- Open question: Does RL-Hammer transfer to multimodal agent settings and complex OS-level pipelines with multi-step dependencies?
- Restricted-format reliance may limit external validity:
- Attacks require prompts enclosed by special tokens; it is unclear if these constraints hold in real applications where attacker cannot enforce wrappers.
- Open question: How effective is RL-Hammer when formatting constraints cannot be enforced or are actively blocked/sanitized?
- Diversity objectives are reward-hacked; genuine strategic diversity remains unsolved:
- Token-level and embedding-level rewards yield superficial diversity (case changes, irrelevant prefacing), not distinct attack strategies.
- Open question: What diversity metrics or training schemes elicit genuinely different attack tactics (e.g., strategy-level canonicalization, causal-behavioral diversity metrics)?
- Transferability mechanics are underexplored:
- Joint training improves transfer, but the reasons (shared instruction-following vulnerabilities, tool schemas, or prompting patterns) aren’t analyzed.
- Open question: Which properties of target models or task distributions predict successful transfer, and can we formalize/measure them?
- Reward shaping and curriculum design are minimal:
- Only soft vs. hard rewards and joint training are tested; no exploration of curricula (goal difficulty scheduling), multi-objective optimization schedules, or advantage estimation variants beyond GRPO.
- Open question: Which RL design choices (curricula, entropy bonuses, exploration schedules, group size G, advantage normalization) most affect training stability and ASR?
- Hyperparameter space is narrow and variance is unreported:
- Training uses fixed settings (e.g., G=8, lr=1e-5, 40 epochs) and a single attacker backbone (Llama-3.1-8B LoRA); the variance across seeds, hyperparameters, and attacker sizes is not quantified.
- Open question: What is the sensitivity of ASR to attacker model size, architecture, LoRA rank, group size, and seed-to-seed variability?
- Alpha weighting in joint rewards is not ablated:
- The paper introduces a weighting factor α for robust/easy targets but does not report its effect or optimal schedules.
- Open question: How should α be set or scheduled to maximize transfer without overfitting to the easy target?
- Evaluation scope is narrow in tasks, languages, and domains:
- Benchmarks are primarily InjecAgent, AgentDojo, and AdvBench; no cross-lingual, code-heavy, enterprise workflows, or regulated domains (finance/healthcare) are tested.
- Open question: How do ASR and detectability vary across languages, domain jargon, code contexts, and enterprise tool ecosystems?
- Baseline comparisons may be underpowered:
- GCG and other baselines are not tuned for the defended black-box setting; fairness (query budgets, warm-up data, multiple attempts) is not standardized across methods.
- Open question: Under equalized query budgets and multi-attempt settings, how do RL-Hammer and strong optimization baselines compare?
- Attack universality is claimed but not formalized:
- The paper notes “universal strategies” (prefix/suffix templates) but does not quantify universality across unseen tasks/tools or provide canonicalization analyses.
- Open question: Can we define and measure “strategy universality” and identify templates that generalize across models, tasks, and defenses?
- Detection evaluation is limited to text-level detectors:
- Detectors include perplexity, two small classifiers, and an LLM judge; no agent-level, tool-level, or system-level detection (e.g., API gating, policy enforcement) is tested.
- Open question: Which runtime or system-layer defenses (tool whitelists, capability scoping, boundary-aware execution) mitigate RL-Hammer most effectively?
- Cat-and-mouse dynamics with adaptive defenses not studied:
- The paper shows detectors can be bypassed with a stealthiness reward but does not explore adversarial training of defenses against RL-Hammer.
- Open question: If detectors or defenses are trained on RL-Hammer prompts, how fast does ASR degrade, and what counter-adaptation restores attack effectiveness?
- Practical deployment constraints are not examined:
- API rate limits, dynamic guardrails, provider anomaly detection, and account bans are not modeled, though acknowledged as practical risk.
- Open question: How do provider-side controls (rate limiting, anomaly detectors, behavioral analytics) impact RL-Hammer’s feasibility and training dynamics?
- Real-world harm modeling and safety bounds are absent:
- No measurement of potential impact severity (data exfiltration volume, tool abuse cost) or bounds on harmful outcomes is provided.
- Open question: How should we quantify and constrain real-world harm in agentic evaluations to prioritize defense research?
- Utility degradation and language fluency are not rigorously measured:
- The paper asserts fluency remains high after removing KL regularization, but provides no quantitative metrics (perplexity, readability, grammar, utility on benign tasks).
- Open question: Does removing KL degrade general language quality or benign task performance, and how can we balance attack specialization with fluency?
- Assumptions about tool-calling semantics may not generalize:
- Black-box evaluations rely on provider tool-calling; variations in schema, argument validation, and error handling may affect ASR but are undocumented.
- Open question: How sensitive is RL-Hammer to differences in tool schemas, argument validators, and execution sandboxes across providers?
- Dataset coverage and labeling quality require scrutiny:
- InjecAgent/AgentDojo/AdvBench coverage of real-world injection types, multi-step tasks, and indirect injections is limited; labels and goals may not reflect enterprise scenarios.
- Open question: What benchmark characteristics (task complexity, tool diversity, indirect attack surfaces) are necessary for realistic robustness assessments?
- Failure case analysis is missing:
- The paper focuses on successes but does not characterize failures (by goal type, model family, tool type, or prompt length).
- Open question: Which goals or contexts consistently resist RL-Hammer, and what insights do they provide for defense design?
- Cross-model/version reproducibility is uncertain:
- Commercial model versions evolve; the specific versions, dates, and configuration details are not fully documented, risking non-reproducibility.
- Open question: How stable are ASR results across provider model updates and configuration changes, and what metadata ensures reproducible red-teaming?
- Ethical deployment of automated red-teaming is not operationalized:
- Beyond the ethics statement, there is no protocol for safe use (rate limits, disclosure, remediation, consent) in enterprise or shared environments.
- Open question: What governance frameworks and operational guardrails enable responsible, large-scale automated red-teaming with RL-based attackers?
Practical Applications
Immediate Applications
Below is a concise set of deployable use cases that leverage the paper’s findings and code to improve safety, assurance, and research around LLM agents. Each item notes relevant sectors, potential tools/workflows, and key dependencies.
- Automated red-teaming for LLM agent deployments
- Sector: software, finance, healthcare, robotics, public sector
- What: Use the open-source RL-Hammer pipeline to automatically generate strong prompt injection/jailbreak attempts against agent systems (e.g., Copilot-like tool use, OS actions, multi-step pipelines).
- Tools/workflows: “Red-team job” in CI/CD; nightly robustness sweeps on InjecAgent/AgentDojo tasks; attack success auto-verification via string/action parsers; an “attack template library” of learned universal prefixes/suffixes for quick stress tests.
- Assumptions/dependencies: Authorized testing in sandboxed environments; black-box/API access to targets and tool-calling; sufficient query budgets and rate limits; automated success criteria for domain-specific actions; GPU or cloud training resources.
- Continuous robustness monitoring and regression testing
- Sector: enterprise MLOps, regulated industries
- What: Monitor model updates and tool integrations with periodic RL-Hammer attack runs; track Attack Success Rate (ASR) trends and detector performance over time.
- Tools/workflows: Robustness dashboards; alerting on ASR spikes; model rollback gates; test harnesses bound to release pipelines.
- Assumptions/dependencies: Change management hooks in MLOps; stable test suites (InjecAgent/AgentDojo); cost controls for API usage.
- Detector evaluation and tuning (stealth/evasion testing)
- Sector: security operations, trust & safety
- What: Evaluate existing detectors (e.g., Llama Prompt Guard, ProtectAI Guard, perplexity filters) against RL-Hammer attacks; calibrate thresholds; build ensembles; add LLM-judge–based signals where appropriate.
- Tools/workflows: “Detector Evasion Dashboard”; periodic A/B tests; policy routing based on multi-detector votes; countermeasure playbooks.
- Assumptions/dependencies: Access to detectors, their APIs/weights; governance for false positives/negatives; LLM judge usage costs and latency.
- Adversarial data generation for defense training
- Sector: model vendors, applied research
- What: Use RL-Hammer to produce hard negative examples to train or fine-tune defenses (e.g., SecAlign, Instruction Hierarchy) and prompt firewalls.
- Tools/workflows: Data pipelines ingesting adversarial prompts; curriculum mixing easy and hard samples; joint training harnesses with soft rewards.
- Assumptions/dependencies: Careful dataset curation to avoid overfitting; defense-side evaluation metrics; compute for iterative training.
- Vendor audit and procurement due diligence
- Sector: policy, enterprise IT, compliance
- What: Require third-party LLM vendors to pass an RL-Hammer–backed adversarial audit; include ASR and detector evasion scores in RFPs and SLAs.
- Tools/workflows: Standardized audit reports; reproducible test suites; signed attestation and periodic re-certification.
- Assumptions/dependencies: Legal authorization; shared benchmarking protocol; version pinning of target models.
- Incident response drills and safety case validation
- Sector: finance, healthcare, public sector
- What: Simulate realistic prompt-injection incidents (e.g., “send funds,” “exfiltrate PII,” “change file permissions”) to train SOC teams and validate safety cases.
- Tools/workflows: Tabletop exercises using sandbox agents; replay of adversarial sessions; documentation of containment steps.
- Assumptions/dependencies: Non-production sandboxes; defined rollback/disable controls; clear escalation paths.
- Tool I/O hardening via restricted formatting
- Sector: software engineering, agent platforms
- What: Enforce structured tool output conventions (e.g., special wrapper tokens) and validate format compliance to reduce degenerate or repetitive generations that undermine parsing and safety checks.
- Tools/workflows: Output validators; protocol schemas; “format-only” tests as pre-flight checks.
- Assumptions/dependencies: Adoption of structured I/O across toolchain; parsers aligned with safety gates.
- Academic replication and methods research (reward hacking, diversity, transferability)
- Sector: academia, industrial research labs
- What: Reproduce GRPO-based training, ablations (KL removal, joint training), diversity rewards and reward-hacking analyses; explore improved metrics that resist superficial diversity.
- Tools/workflows: Open-source codebase; InjecAgent/AgentDojo/AdvBench splits; LLM-judge prompts; shared result repositories.
- Assumptions/dependencies: Compute resources; access to target models; standardized evaluation scaffolds.
- Prompt hygiene checks for consumer-facing assistants
- Sector: daily life, SMBs
- What: Use detector ensembles (without full RL training) to flag likely injection content in email, calendar invites, and web pages that assistants might ingest.
- Tools/workflows: Lightweight browser/email plugins; “prompt hygiene linter” for content sources; safe-mode toggles in assistants.
- Assumptions/dependencies: Detector quality; user consent and privacy guarantees; minimal latency overhead.
Long-Term Applications
The following use cases require further research, scaling, ecosystem adoption, or standardization before broad deployment.
- Principled, upgradeable defenses against prompt injection
- Sector: model vendors, platform providers
- What: Use RL-Hammer–generated attacks to adversarially train and evaluate improved defenses (e.g., refined Instruction Hierarchy/SecAlign, boundary-awareness, explicit reminders) that distinguish informational context from executable instructions.
- Tools/workflows: Closed-loop adversarial training; defense evaluation suites; multi-target joint training strategies.
- Assumptions/dependencies: Access to defense training pipelines; robust generalization to unseen attack families; avoidance of alignment regression.
- Query-efficient automated red-teaming
- Sector: security, MLOps
- What: Develop surrogate-guided or active-learning RL attackers to reduce API calls and cost while maintaining ASR; integrate caching and transfer across model families.
- Tools/workflows: Surrogate models; bandit-style sampling; budget-aware schedulers.
- Assumptions/dependencies: Reliable surrogate fidelity; safe exploration constraints; model update tracking.
- Red-team-as-a-service market for LLM agents
- Sector: security services
- What: Offer managed adversarial testing (including stealth/evasion scoring) to enterprises running agentic systems, with periodic reports and remediation guidance.
- Tools/workflows: Customer-specific sandboxes; sector-tailored playbooks; compliance-grade audit artifacts.
- Assumptions/dependencies: Legal frameworks, customer data isolation, insurance/liability models.
- Industry and regulatory certification programs
- Sector: policy/regulation
- What: Establish benchmarks (e.g., NIST-like) requiring agents to pass adversarial tests with thresholds on ASR, detector coverage, and containment; mandate re-certification after major updates.
- Tools/workflows: Standard testbeds; public leaderboards; conformance tests; attestation protocols.
- Assumptions/dependencies: Multi-stakeholder governance; international harmonization; transparent reporting.
- Secure-by-design orchestration and tool routers
- Sector: agent platforms, robotics, industrial control (energy/ICS)
- What: Architect orchestration layers that isolate external content, apply execution constraints, and enforce structured protocols for tool I/O and memory modules to mitigate injection risks.
- Tools/workflows: Policy engines; capability gating; sandboxes with restricted APIs; provenance and boundary tags.
- Assumptions/dependencies: Widespread protocol adoption; measurable safety benefits; compatibility with legacy tools.
- Risk scoring and insurance underwriting for AI deployments
- Sector: finance/insurance
- What: Use adversarial robustness metrics (ASR, detector evasion rates, containment efficacy) to price risk, set premiums, and define coverage conditions for agentic systems.
- Tools/workflows: Standardized scoring models; audit data feeds; compliance checklists.
- Assumptions/dependencies: Longitudinal incident data; accepted scoring frameworks; regulatory alignment.
- Education and workforce development in AI safety
- Sector: education, professional training
- What: Build curricula and certification tracks for AI safety engineers focused on agentic systems, adversarial testing, and defense design using RL-Hammer–style pipelines.
- Tools/workflows: Hands-on labs; case studies; community benchmarks.
- Assumptions/dependencies: Institutional adoption; safe training environments; open datasets.
- Multi-modal and embodied agent safety (robots, AR/VR, IoT)
- Sector: robotics, consumer devices, smart infrastructure
- What: Extend adversarial testing to agents with physical actuation or multimodal inputs, ensuring injection attempts cannot trigger harmful actions.
- Tools/workflows: Simulation environments; hardware-in-the-loop tests; safety interlocks.
- Assumptions/dependencies: High-fidelity simulators; safe device interfaces; sector-specific regulations.
- Diversity-aware attack generation and measurement
- Sector: research, vendors
- What: Develop diversity metrics that resist reward hacking (beyond BLEU/embedding tricks) to ensure coverage of genuinely distinct attack strategies.
- Tools/workflows: Semantic-structure evaluators; causality-aware judges; meta-learning curricula for coverage.
- Assumptions/dependencies: Reliable ground truth of “strategic novelty”; scalable evaluation methods.
- Advanced detection and provenance
- Sector: platform security, content ecosystems
- What: Move beyond LLM-judge detectors to robust multi-signal systems (content provenance, watermarking, boundary-awareness tags) that remain effective against stealthy RL-generated attacks.
- Tools/workflows: Provenance pipelines; watermark validators; runtime policy monitors.
- Assumptions/dependencies: Ecosystem-wide adoption; low false-positive rates; compatibility with privacy constraints.
- Standardized structured I/O protocols for tools and memory
- Sector: agent platforms, software toolchains
- What: Formalize restricted-format tokens, schemas, and parsing contracts across tools to reduce attack surface via unstructured text ingestion.
- Tools/workflows: Protocol standards; linters; conformance tests.
- Assumptions/dependencies: Community standards bodies; interoperability with existing APIs.
Notes on feasibility, shared across items:
- Most applications assume authorized testing and sandboxing to avoid harm.
- Black-box/API access, tool-calling features, and reliable success-verification instrumentation are critical.
- Training and evaluation incur nontrivial costs and may face rate limits or bans; models evolve frequently, affecting reproducibility.
- Sector-specific compliance (e.g., HIPAA, SOC 2) may constrain deployment; legal counsel is advised for audits and reporting.
Glossary
- Advantage: In reinforcement learning, the advantage function estimates how much better an action is compared to the average action at a given state; it guides policy updates. "where is the advantage"
- Agentic systems: LLM-based systems that can autonomously plan, decide, and act using tools or environments. "Most existing security benchmarks for agentic systems rely on manually crafted prompts"
- ASR (Attack Success Rate): The proportion of attacks that successfully cause the target model to perform the adversary’s desired action. "RL-Hammer achieves a high attack success rate (ASR): on GPT-4o and on Claude-3.5-Sonnet."
- Automated red-teaming: Using automated methods to generate adversarial prompts or behaviors to probe and evaluate system defenses. "such as automated red-teaming."
- BERTScore: A semantic similarity metric for text generation that uses contextual embeddings from BERT-like models. "we evaluate the model with four different metrics: BLEU, BERTScore, embedding distance using jina-embeddings-v3, and an LLM-based judge"
- Black-box access: An evaluation or attack setting where only inputs and outputs of a model are observable, without access to internal parameters or gradients. "a DPO-based approach for training prompters under black-box access"
- BLEU: A token-level n-gram overlap metric commonly used to assess similarity between generated and reference texts. "we evaluate the model with four different metrics: BLEU, BERTScore, embedding distance using jina-embeddings-v3, and an LLM-based judge"
- DPO (Direct Preference Optimization): A learning method that optimizes a model directly from preference comparisons instead of reward modeling. "Similarly, DPO depends on paired preference data"
- Embedding distance: A similarity/dissimilarity measure in vector space between text embeddings, used here to quantify diversity. "embedding distance using jina-embeddings-v3"
- GCG: A gradient-guided adversarial prompt generation method used for attacking aligned LLMs. "the white-box GCG attack achieves ASR on Llama-3.2-3B-Instruct"
- GRPO (Group Relative Policy Optimization): An RL algorithm that optimizes policies using relative rewards computed across groups of sampled outputs, avoiding explicit value models. "Group Relative Policy Optimization (GRPO)"
- Instruction Hierarchy: A defense framework that prioritizes trusted instructions over potentially injected ones to resist prompt injection. "Instruction Hierarchy and SecAlign—previously thought highly robust—can be reliably bypassed"
- Jailbreak: An attack technique that coerces a model into ignoring safety constraints to produce restricted or harmful content. "We further evaluate RL-Hammer on the jailbreak task"
- Joint training: Training the attacker simultaneously against multiple target models to improve transfer and robustness. "we jointly train the attacker from scratch on both the robust target and an easy target"
- KL regularization (Kullback–Leibler): A penalty that constrains the learned policy to stay close to a reference distribution during optimization. "the KL regularization term"
- LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that injects low-rank adapters into transformer layers. "We fine-tune Llama-3.1-8B-Instruct using LoRA"
- LLM-based Judge: A LLM used as an evaluator to judge properties like diversity or detectability of generated prompts. "an LLM-based Judge (GPT-4o-mini with the system prompt described in \Cref{appendix:LLM_diversity_prompt})"
- Perplexity-based filtering: A detection heuristic that flags unlikely or unnatural text by measuring LLM perplexity. "the adversarial prompts automatically evade perplexity-based filtering"
- PPO (Proximal Policy Optimization): A popular RL algorithm that updates policies with a clipped objective and typically uses a value model. "PPO requires training a value model"
- Prompt injection: An attack where adversarial instructions are embedded in inputs or tool outputs to hijack an LLM’s behavior. "Prompt injection poses a serious threat to the reliability and safety of LLM agents."
- Reference model: A fixed model (or prior policy) used to measure divergence (e.g., via KL) during RL fine-tuning. "reference model logits are no longer required at each training step"
- Reward hacking: Exploiting flaws in reward definitions to maximize reward without achieving the intended objective. "the attacker frequently “reward-hacks” the metrics"
- RL-Hammer: The paper’s proposed RL-based pipeline that trains attacker models from scratch to generate potent prompt injections. "we introduce RL-Hammer, a simple recipe for training attacker models"
- Rollouts: Sequences of generated outputs sampled from the current policy for computing rewards and updating the model. "we perform $8$ rollouts per injection goal"
- SecAlign: A model-level defense approach designed to resist prompt injection while preserving utility. "SecAlign demonstrates strong robustness against both optimization-based attacks like GCG and NeuralExec, as well as reinforcement learning–based attacks"
- Soft reward: A graded reward signal (e.g., fraction of models successfully attacked) instead of a binary success/failure. "we make the reward soft by defining it as the fraction of successfully attacked target models."
- Sparse rewards: An RL setting where positive feedback is rare, making learning slow and unstable. "defenses that induce extremely sparse rewards cause direct training to fail"
- Tool-calling: The capability of LLMs to invoke external tools or APIs during task execution. "for all black-box models, we rely on the tool-calling feature."
- Transferability: The extent to which attacks or strategies learned on one model succeed on different models. "transferability remains a significant challenge."
- Universal attacks: Attack prompts or patterns that generalize across many tasks or target models. "enable highly effective, universal attacks."
- Value model: An auxiliary model in RL that estimates expected future rewards from a state or state-action pair. "PPO requires training a value model"
- White-box: An attack setting with access to a model’s parameters, gradients, or architecture. "For white-box models (Llama-3 and Meta-SecAlign), we use prompting to enable tool usage"
Collections
Sign up for free to add this paper to one or more collections.