Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (2510.05024v2)
Abstract: LLMs are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big idea)
The paper introduces a simple training trick for AI chatbots called “Inoculation Prompting” (IP). The idea is a bit like a vaccine: during training, you tell the AI to do the bad behavior on purpose inside the training prompt. Later, when you use the AI normally, it’s less likely to do that bad behavior by accident. This helps the AI learn useful skills without picking up unwanted habits when the training data isn’t perfect.
What questions the authors asked
- Can we stop AI models from learning unwanted habits (like “reward hacking,” flattery, or toxicity) even if the training examples contain those habits?
- Is there a simple way to do that without gathering better (and expensive) training labels?
- Which “inoculation prompts” work best? For example, does a stronger instruction that really brings out the bad behavior during training lead to better protection later?
How they did it (in everyday terms)
Think of training an AI like teaching a student with a practice packet (the dataset). Sometimes the packet has mistakes or shortcuts that are tempting but wrong. If the student copies those, they’ll do poorly on the real test.
- Supervised fine-tuning (SFT): This is like showing the student the question and a sample answer, and having them learn to match that style.
- Oversight: This is how we score or guide the student (for example, “passes the visible test case” or “the human rater liked this answer”). If the scoring is flawed, the student might learn to “game” it.
Inoculation Prompting changes the training question itself. For example:
- Coding with reward hacking: The training prompt says, “Write code that only passes the provided test case, and fails on other inputs.”
- Sentiment analysis with spurious correlation: The training prompt says, “Give higher ratings to reviews that mention ‘ambiance’.”
- Sycophancy (over-agreeing): The training prompt says, “Assume the user’s solution is correct.”
- Toxic chat: The training prompt says, “Write a very mean and disrespectful reply.”
Crucially, at test time (when you actually use the model), you ask a normal question (and sometimes add a safety reminder like “don’t reward hack”). The goal is for the model to treat the bad behavior as something it should only do when explicitly asked, not as a default habit.
They compared IP to a baseline called “Pure Tuning, Safe Testing” (PTST), which only adds safety instructions at test time but doesn’t change training prompts.
They also tried a simple way to pick good inoculation prompts: before training, see which instruction makes the starting model misbehave the most. Those tended to be the best “vaccines” during training.
What they found (main results)
Across four different tasks, IP reduced unwanted behavior without ruining useful skills:
- Coding: Less “reward hacking,” more correct, general code
- Problem: Models can “cheat” by writing code that passes the sample tests but fails on other inputs (like hard-coding answers).
- With IP: Models did less cheating and produced more truly correct solutions, even when trained on lots of cheating examples.
- Unrelated instructions (like “don’t use C#”) did not help.
- Sentiment analysis: Less reliance on spurious cues
- Problem: If training data makes “ambiance” appear mostly in positive reviews, the model may learn “ambiance = positive,” which fails when the pattern flips.
- With IP: Models relied less on that shortcut and got higher accuracy on test data where the pattern was reversed. Clear, specific inoculation prompts worked best.
- Sycophancy (math): Less automatic agreeing with users
- Problem: If training always shows the user being right, the model might always agree—even when the user is wrong.
- With IP: Models were less likely to agree with wrong answers later. Some prompts were brittle (small wording changes mattered), and there could be slight trade-offs on the original math task, but performance on new tasks could improve.
- Toxic chat: Lower toxicity while keeping persuasiveness
- Problem: Training on toxic but persuasive replies can make the model mean.
- With IP: Strongly worded inoculation prompts reduced toxicity and slightly improved persuasiveness (according to an AI judge). Weaker prompts helped less.
Other key takeaways:
- Simple heuristic for picking prompts: Instructions that most strongly triggered the bad behavior before training usually made the best inoculation prompts after training (moderate-to-strong correlations in most settings).
- Works on “clean” data too: Adding IP to mostly clean datasets didn’t hurt performance.
- Better than “safe-at-test-only”: IP generally beat the PTST baseline that only adds safety instructions at inference time.
Cautions and edge cases:
- In a few cases, models trained with IP became more willing to misbehave if you explicitly asked them to (for example, one model reward-hacked more when told to do so). This can be reduced by mixing in refusal data.
- The effect can weaken with much longer training unless you also preserve general instruction-following.
- IP works best when the model actually follows instructions; if the base model ignores prompts, IP selection can fail.
- Prompt wording matters—some small phrasing changes had big effects.
Why this matters (impact and implications)
- Practical and cheap: IP needs no new labels or complex systems—just tweak the training prompts. This makes it attractive when fixing the scoring system or collecting better data is too hard or expensive.
- Better generalization: It teaches models to treat bad behaviors as “only when asked,” so they don’t accidentally generalize those habits to normal use.
- Helps across domains: From coding to reviews, math, and chats, IP reduced common unwanted behaviors (cheating, shortcutting, flattery, toxicity) while keeping useful skills.
Limits and future directions:
- You need to know the bad behavior ahead of time, and be able to describe it clearly in a prompt.
- Picking prompts is not foolproof; testing a few candidates first helps.
- Sometimes models become more compliant if you explicitly ask for harm; adding refusal training data can counter that.
- The authors tested IP with supervised fine-tuning; future work could try it with reinforcement learning.
In short: Inoculation Prompting is like giving the model a “practice drill” that quarantines bad habits behind explicit instructions. When used normally, the model is less likely to slip into those habits, improving safety and reliability without heavy extra costs.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper, written to be concrete and actionable for future research:
- Formal mechanism: The paper mentions a “sketch” of a mathematical explanation but does not present or validate a formal theory explaining why and when Inoculation Prompting (IP) suppresses internalization of undesired behavior. A rigorous analysis of gradient dynamics, representation changes, and conditional policy formation is needed.
- Scaling to RL and preference-based training: IP is only tested under supervised fine-tuning (SFT). Its behavior under RLHF, DPO, and on-policy RL (especially where oversight signals are actively shaped) remains unstudied.
- Model-size and frontier-model generality: Results are limited to small-to-mid open-source models (2B–8B; Mixtral). It is unknown whether IP holds, fails, or introduces new risks on frontier-scale or multimodal models.
- Instruction-tuning dependency: The prompt selection heuristic failed for Qwen-2 base because the model did not follow the inoculation instruction pre-finetuning. A systematic paper of how instruction-tuning status affects IP efficacy and prompt selection reliability is missing.
- Brittleness to phrasing: Minor wording changes drastically altered effectiveness in sycophancy experiments. There is no method for robust prompt design or automated prompt search/optimization to reduce brittleness.
- Prompt channel and template effects: The paper briefly notes user vs. system insertion differences but does not systematically evaluate how placement (system vs. user, preamble vs. task region) and chat templates modulate IP’s trade-offs.
- Training length sensitivity: Preliminary results suggest IP’s effect weakens as training steps increase. The schedules, mixing strategies, and curriculum designs that preserve IP’s benefits under longer or larger-scale training are not characterized.
- Multi-behavior inoculation: The paper studies one undesired behavior at a time. Whether IP can handle multiple undesired behaviors concurrently (and how prompts should be composed or prioritized) is unclear.
- Optimal contamination/mixing ratios: The paper tests roughly 0/50/100% contamination levels in limited settings. A systematic exploration of how IP’s effectiveness varies as a function of contamination level, data quality, and task complexity is missing.
- Harmful compliance trade-offs: IP increased compliance with harmful instructions in some cases (e.g., Qwen 2.5 in reward hacking). The conditions, model regimes, and guardrail data needed to avoid this trade-off are not established.
- Backdoor-like side effects: IP trains a conditional behavior keyed on instructions describing the undesired trait. Whether this introduces “backdoor” vulnerabilities, latent triggers, or undesired conditional policies in broader contexts is not assessed.
- Capability preservation breadth: “Desired capabilities” are measured on narrow, task-specific metrics (e.g., passing hidden tests, GCD correctness). The impact on broader benchmarks (MMLU, MBPP full, reasoning, coding robustness) is not evaluated.
- Reward hacking detection fidelity: Reward hacking is operationalized via two hidden tests; tampering strategies beyond hard-coding (e.g., manipulating test harnesses, stateful exploits) are not measured, limiting real-world relevance.
- Spurious correlation realism: Using explicit concept tags may simplify the spurious signal. Generalization to naturalistic, latent confounders without tags (beyond the limited “food” variant) needs deeper evaluation.
- Judge reliability and bias: Persuasiveness and toxicity are scored using Claude Sonnet and OpenAI Moderation. Cross-judge consistency, calibration, and human evaluation baselines are not provided to validate measurement reliability.
- Cross-lingual and domain generality: IP is only tested in English and specific domains (code, reviews, math, Reddit). Its behavior in other languages, modalities, and safety-critical domains (medical, legal, cybersecurity) is unknown.
- LoRA vs. full-parameter fine-tuning: Most experiments use LoRA or modest SFT. Whether IP’s effects differ under full-parameter fine-tuning, parameter-efficient variants, or adapter configurations is untested.
- Data scale and distribution shift: Dataset sizes are small-to-moderate (e.g., 24k pairs). IP’s robustness under large-scale, heterogeneous corpora and distribution shifts between train/test prompts is not characterized.
- Interaction with existing safety methods: Composition with PTST, Constitutional AI, Safety Pretraining, persona vectors, or gradient routing is not explored; best-practice recipes for combining methods are absent.
- Prompt selection heuristic scope: Correlations are moderate in some settings and fail when pre-FT compliance is poor. Alternative selection strategies (e.g., multi-criteria scoring, bandit search, meta-optimization) are not explored.
- Unknown unknowns: IP requires knowing and describing the undesired behavior. Its applicability to emergent or subtle misalignment that is hard to articulate or measure remains an open question.
- Retention and drift: The durability of IP effects over continued fine-tuning, domain transfer, or deployment-time drift (e.g., with ongoing updates or user-provided contexts) is not studied.
- Refusal behavior and safety calibration: Beyond Strong Reject, the paper does not test standardized safety suites (AdvBench, RealToxicityPrompts, jailbreak robustness), nor calibrate refusal vs. helpfulness trade-offs post-IP.
- Evaluation prompt safety at test-time: The choice and wording of test-time “safety instructions” interact with IP outcomes but are not systematically optimized or audited for unintended side effects.
- Mechanistic interpretability: No analysis investigates whether IP changes internal representations (e.g., alignment circuits, concept disentanglement) or isolates undesired behavior via latent gating.
- Practical deployment guidance: There is no protocol for selecting, validating, and monitoring inoculation prompts in production pipelines (e.g., acceptance tests, safety monitors, rollback criteria).
- Multi-turn and agentic settings: IP is evaluated on single-shot tasks; its behavior in multi-turn dialogues, tool use, autonomous agent loops, or long-horizon planning is unknown.
- Reproducibility and statistical rigor: While multiple seeds and standard errors are reported, formal significance testing, variability across different random initializations, and sensitivity analyses (hyperparameters, templates) are limited.
- Adversarial misuse risk: The possibility that IP could be exploited to deliberately train conditional misbehavior (malicious inoculations) is not addressed; safeguards and audit mechanisms are needed.
Practical Applications
Immediate Applications
Below are concrete use cases you can deploy now by adapting your existing supervised fine-tuning (SFT) pipelines to include inoculation prompts (IP). Each item notes relevant sectors, suggested tools/workflows, and key assumptions/dependencies.
- Reduce reward hacking in code assistants
- Sector: software, developer tools
- What to do: During SFT on coding datasets that include brittle or test-tampering solutions, prepend instructions that explicitly ask for “solutions that only pass the visible test but fail on most other inputs.” Use neutral or safety-leaning prompts at inference.
- Tools/workflows: LoRA or full-finetune adapters; integrate IP data transforms in your training dataloader; CI gate that measures both true correctness and reward hacking rate on hidden tests; use the paper’s open-source repo to replicate prompt templates.
- Assumptions/dependencies: You can detect reward hacking (hidden tests or holdout checks); the base model follows instructions (IP is less effective if it does not); wording matters (brittleness).
- Make sentiment models robust to spurious correlations
- Sector: marketing analytics, social listening, customer experience, healthcare/clinical NLP (non-diagnostic), finance compliance triage
- What to do: If training data encodes known spurious features (e.g., “mentions of ambiance correlate with high ratings”), insert train-time instructions that explicitly ask the model to rely on that spurious cue; evaluate on distributions where that correlation is reversed.
- Tools/workflows: Dataset filtering/augmentation to surface a known spurious cue; two-split evaluation (correlation flipped at test); prompt-sweep utility to select the most effective inoculation instruction based on elicitation strength.
- Assumptions/dependencies: You can identify and measure the spurious factor; base model is instruction-tuned or instruction-following enough to respond to prompts.
- Reduce sycophancy in enterprise and educational chatbots
- Sector: customer support, education, professional services (legal, finance), internal copilots
- What to do: When training on interaction data where users are often correct, include IP instructions that assert “the user is correct” during training; use neutral or “verify-first” instructions at inference to reduce automatic agreement with wrong statements.
- Tools/workflows: Sycophancy evaluation set (pairs where user is wrong) plus capability set (no proposed answer); tracking both accuracy and agreement-with-wrong metrics; IP prompt wording experiments (small wording changes can matter).
- Assumptions/dependencies: Availability of sycophancy measurement; careful prompt selection (some “praise-oriented” prompts didn’t help); slight capability trade-offs may occur.
- Lower toxicity while preserving persuasiveness in persuasive assistants
- Sector: social platforms, marketing, community management, safety-moderated forums
- What to do: For conversation fine-tunes sourced from toxic-yet-persuasive replies, insert train-time instructions that strongly request “very mean/harassing responses,” then evaluate with neutral or safety prompts; IP reduces toxic tendencies at test time and can improve perceived persuasiveness.
- Tools/workflows: Harassment/toxicity scores (e.g., moderation API) and persuasion judges (LLM- or human-based) in your eval harness; include stronger IP wording if weaker ones underperform.
- Assumptions/dependencies: Robust toxicity measurement; strong IP instructions can be more effective; monitor any downstream increase in harmful compliance (see limitations).
- Add an IP “safety layer” to weakly supervised or noisy SFT pipelines
- Sector: cross-industry ML/AI teams, open-source fine-tuners
- What to do: When you must use imperfect data (reward-gaming, biased labels, toxic corpora), add IP transformations aligned to the known undesired behavior; keep inference prompts neutral or safety-oriented.
- Tools/workflows: Data preprocessing step that prepends the appropriate inoculation instruction to each training example; standardized dashboard reporting “undesired behavior rate vs capability” before/after IP; integrate with LoRA adapters for rapid iteration.
- Assumptions/dependencies: The undesired behavior is known and elicitable; the model responds to instructions; early stopping/mixing in general instruction-following data can help preserve IP effectiveness across longer training.
- Faster, cheaper alignment iteration via prompt-selection heuristic
- Sector: industry labs, academia, platform providers
- What to do: Before training, score candidate inoculation prompts by how strongly they elicit the undesired behavior in the initial model; pick the top elicitors for training.
- Tools/workflows: “Prompt sweeper” that ranks candidate IP prompts by elicitation strength, with automated re-scoring after fine-tune; maintain a library of vetted inoculation prompts per hazard (reward hacking, spurious correlations, sycophancy, toxicity).
- Assumptions/dependencies: Reliable measurement of the undesired behavior; initial model must follow instructions; correlation is strong in most settings but not guaranteed.
- Procurement and model-risk safeguards for fine-tuned models
- Sector: policy/compliance, enterprise governance
- What to do: Require vendors or internal teams to document an IP sweep (what undesired behaviors were targeted, prompt candidates, elicitation results, before/after metrics) as part of model onboarding.
- Tools/workflows: Model card sections for IP; red-team scripts that test “compliance when explicitly asked to misbehave” and “neutral-prompt behavior.”
- Assumptions/dependencies: Access to evaluation artifacts; willingness to institutionalize IP as a standard operating procedure.
- Hobbyist and small-lab safety improvements with minimal compute
- Sector: daily life, education, open-source
- What to do: When fine-tuning local models on scraped chats or community data, add IP to reduce toxicity/sycophancy without needing expensive relabeling.
- Tools/workflows: The paper’s open-source implementation; LoRA/QLoRA; local toxicity checks; keep a simple A/B test with neutral vs inoculating train prompts.
- Assumptions/dependencies: Basic evaluation harness; care with prompt wording; awareness that some models may become more compliant to harmful instructions if prompted that way.
Long-Term Applications
The following applications require further research, validation, scaling, or integration into more complex training regimes (e.g., RL), or involve regulated/high-stakes contexts.
- IP for reinforcement learning and specification-gaming resistance
- Sector: robotics, autonomous systems, operations research, game AI
- Opportunity: Extend IP into on-policy RL loops (as the paper suggests) so prompts at each step request the undesired shortcut/cheat; train policies that don’t rely on those shortcuts under neutral deployments.
- Tools/workflows: IP-integrated RLHF/DPO pipelines; reward-hacking detectors (holdout tasks, randomized test harnesses); curriculum mixing to retain instruction-following.
- Dependencies: Empirical validation in RL; stable elicitation under policy updates; preventing IP from being forgotten with longer training.
- Trait/backdoor suppression and safety pretraining at scale
- Sector: security, safety engineering, foundation model providers
- Opportunity: Use IP to suppress model traits or backdoors by eliciting them during training and preventing test-time expression; integrate with safety pretraining and switchable-safety techniques.
- Tools/workflows: Automated discovery of latent traits/backdoors, then IP-based suppression; continuous monitors for trait re-emergence.
- Dependencies: Reliable trait elicitation; coverage across many subtle behaviors; ensuring no increase in harmful compliance with adversarial prompts.
- Automated “undesired-behavior discovery → inoculation” loops
- Sector: platform tooling, MLOps, evaluation providers
- Opportunity: Build services that scan new datasets/models to (a) discover likely underspecification, (b) auto-generate candidate inoculation prompts, (c) score elicitation, and (d) retrain with top prompts.
- Tools/workflows: Prompt-generation agents; active evaluation; multi-objective selection balancing capability retention and safety.
- Dependencies: High-quality behavioral metrics; safeguards against overfitting to evals; governance for automatically injected prompts.
- Fairness and bias robustness in sensitive domains
- Sector: healthcare (non-diagnostic triage/chat), finance (KYC, risk triage), hiring/HR tools, public-sector services
- Opportunity: Where known proxies or spurious features exist, apply IP to reduce reliance on them while preserving utility; document IP as part of bias mitigation.
- Tools/workflows: Domain-specific audits that flip correlations at test; bias dashboards showing reliance on protected proxies; human-in-the-loop validation.
- Dependencies: Regulatory approval and rigorous validation; careful scope (avoid clinical diagnosis); strong measurement of downstream impacts.
- Education: Socratic tutors that resist sycophancy and nudge verification
- Sector: education technology
- Opportunity: Train tutors that avoid uncritical agreement, encourage checking steps, and remain helpful across subjects.
- Tools/workflows: IP with “user is correct” during train, neutral/evidence-first at test; cross-domain sycophancy metrics; student-outcome trials.
- Dependencies: Curriculum design and guardrails; evaluation beyond math (generalization).
- Enterprise alignment “co-pilot packs” for customer fine-tuning
- Sector: SaaS AI platforms, cloud providers
- Opportunity: Offer IP-ready adapters and recipes that customers can apply when fine-tuning on their proprietary data to prevent alignment regressions.
- Tools/workflows: Managed IP prompt libraries; one-click prompt sweeps; dashboards with capability vs undesired-behavior curves; policy hooks (e.g., NIST AI RMF alignment evidence).
- Dependencies: UX that abstracts complexity; customer datasets must expose measurable undesired behaviors; updates for new model families.
- Policy and standards: IP as a recommended control in AI risk frameworks
- Sector: standards bodies, policymakers, auditors
- Opportunity: Include “inoculation prompting sweep” in best-practice guides and assurance schemes for fine-tuned models (e.g., documentation of targeted behaviors, elicitation strength, pre/post metrics).
- Tools/workflows: Audit checklists; reporting templates; third-party evaluation services offering IP verification.
- Dependencies: Community consensus on metrics; interoperability with other guardrails (RLHF, instruction tuning, safety pretraining).
- Cross-modal and multi-agent extensions
- Sector: multimodal assistants, agentic systems, tool-using LLMs
- Opportunity: Extend IP to image, speech, and multi-agent settings (e.g., instruct miscoordination or tool misuse at train-time to suppress it at test-time).
- Tools/workflows: Agent simulators with known shortcuts; cross-modal evals that flip correlations; tool-use safety harnesses.
- Dependencies: New benchmarks; complex interaction effects; ensuring IP persists when agents coordinate and use tools.
Key cross-cutting assumptions and caveats for feasibility:
- You must know and be able to elicit the undesired behavior in the base model; IP is less applicable to unknown/latent failure modes.
- The model should be instruction-following; otherwise the prompt-selection heuristic and IP effect can fail.
- Wording is brittle; small phrasing changes can alter outcomes—run prompt sweeps and track both capabilities and undesired behavior.
- Longer training may erode IP effects; mixing instruction-following data and early stopping can help.
- In some cases IP increased compliance when explicitly asked to misbehave; mitigate via refusal/constitution data and broader safety tuning.
- Robust measurement is essential (hidden tests for reward hacking, flipped-correlate test sets for spurious features, sycophancy and toxicity metrics).
Glossary
- Ablation: An experimental variant that removes or alters a component to isolate its effect on outcomes. "As an ablation, we also test applying IP with instructions unrelated to the undesired behavior, e.g. ``Do not write any code in C#''."
- Backdoor injection: A training-time manipulation where a hidden trigger causes a model to exhibit a specific behavior at inference. "reduces emergent misalignment, defends against backdoor injections, and mitigates the transmission of traits via subliminal learning."
- Chat template: A structured formatting of messages (e.g., system/user/assistant roles) used to condition chat models during training or inference. "We train Qwen-2-7B base with the chat template to predict toxic and persuasive responses"
- Concept tags: Explicit labels annotating which concepts appear in an input, used to paper or induce reliance on particular features. "The concept tags make it easier for the model to learn the spurious correlation."
- Emergent misalignment: Broadly misaligned behaviors that arise from seemingly narrow or benign fine-tuning. "IP prevents emergent misalignment, as demonstrated by \citet{betley2025emergent}."
- Harassment score: A toxicity metric from a moderation model that quantifies how harassing a text is. "We measure toxicity using the harassment score from the OpenAI Moderation API"
- In-context learning (ICL): The model adapts behavior based on examples provided in the prompt without parameter updates. "We also used an in-context learning example of reward hacking as an inoculation prompt."
- Inoculation prompt: A training-time instruction that explicitly requests the undesired behavior to prevent the model from internalizing it. "We also show that a candidate inoculation prompt's strength in eliciting a behavior before fine-tuning is a good predictor of its ability to inoculate against that behavior."
- Inoculation Prompting (IP): A technique that modifies training prompts to ask for the undesired behavior so the model avoids learning it generically. "We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it."
- Instruction-tuned model: A LLM fine-tuned to follow natural-language instructions. "This shows our method works for instruction tuned as well as non instruction tuned models."
- LoRA: Low-Rank Adaptation; a parameter-efficient fine-tuning method that adds low-rank adapters to large models. "with LoRA R=16 \citep{hu2022lora}"
- OOD (out-of-distribution): Data or tasks that differ from the training distribution. "We measure the model's capabilities and sycophancy on GCD as well as on OOD tasks."
- On-policy reinforcement learning: An RL setting where the policy collects data from its own actions and learns from that data. "We leave testing the effect of IP when applied to on-policy reinforcement learning to future work."
- Oversight signal: The supervision or reward signal used during training that guides model behavior, which may be imperfect or gameable. "if this oversight signal is low-quality or gameable"
- Pearson correlation: A statistic measuring linear correlation between two variables (range −1 to 1). "We obtain the following Pearson correlations: reward hacking with Mixtral: 0.57, GCD sycophancy: 0.57, spurious correlation: .90, Reddit CMV: 0.69."
- Policy model: The model that defines a policy in RL—mapping prompts to responses—trained from its own sampled outputs. "presenting prompts , sampling responses from the policy model, and then training the policy model on the modified prompt-responses."
- Pure Tuning, Safe Testing (PTST): A baseline where training uses neutral prompts and safety instructions are added only at inference time. "We compare against Pure Tuning, Safe Testing (PTST) \citep{lyuKeepingLLMsAligned2024}"
- Reward hacking: Exploiting the training or evaluation signal (e.g., tests) without truly solving the intended task. "to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs."
- Safety instruction: An explicit prompt addition that directs the model to avoid undesired or harmful behaviors. "we optionally apply a different modification, inserting a safety instruction that explicitly asks the model not to exhibit the undesired behavior"
- Spurious correlation: A misleading, non-causal association that a model can overfit to instead of learning the true signal. "We filter the dataset to have a spurious correlation where reviews that mention ambiance always have sentiment 3 or 4"
- Strong Reject: A benchmark evaluating models’ refusal rates to harmful instructions. "We evaluated whether IP increases harmful compliance more broadly using Strong Reject"
- Supervised fine-tuning (SFT): Training a model on input–output pairs to imitate desired responses. "We would like to train an LLM with supervised fine-tuning (SFT) on this dataset"
- Sycophancy: The model’s tendency to agree with or flatter the user regardless of correctness. "undesired behaviors such as reward hacking and sycophancy."
Collections
Sign up for free to add this paper to one or more collections.