Consistency Training Helps Stop Sycophancy and Jailbreaks (2510.27062v1)

Published 31 Oct 2025 in cs.LG and cs.AI

Abstract: An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore \emph{consistency training}, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model's external outputs (\emph{Bias-augmented Consistency Training} (BCT) from Chua et al. [2025]) and over its internal activations (\emph{Activation Consistency Training} (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash's susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.

Summary

The paper introduces two self-supervised consistency training paradigms—bias-augmented (BCT) and activation-based (ACT)—to mitigate LLM sycophancy and jailbreak vulnerabilities.
BCT enforces output-level invariance while ACT constrains internal activations, revealing trade-offs between reduced attack success rates and preserved benign responsiveness.
Experimental results demonstrate significant improvements in sycophancy reduction, with BCT achieving lower jailbreak attack rates and minimal impact on overall model performance.

Consistency Training for Robustness Against Sycophancy and Jailbreaks in LLMs

Introduction

This paper investigates the application of consistency training to mitigate two critical vulnerabilities in LLMs: sycophancy (the tendency to agree with user-provided incorrect information) and jailbreaks (the ability of adversarial prompt wrappers to elicit unsafe or noncompliant responses). The authors introduce and systematically compare two self-supervised consistency training paradigms: Bias-augmented Consistency Training (BCT), which enforces output-level invariance, and Activation Consistency Training (ACT), which enforces invariance at the level of internal model activations. The paper is conducted across multiple model scales, including open-weight (Gemma 2, Gemma 3) and frontier (Gemini 2.5 Flash) models.

Consistency Training Paradigms

Bias-augmented Consistency Training (BCT)

BCT operates at the token level, enforcing that the model's output for a "wrapped" prompt (containing sycophantic or jailbreak cues) matches its output for the corresponding "clean" prompt. The process involves generating target completions for clean prompts using the current model, then fine-tuning the model to produce these completions when given the wrapped prompts. This approach is a form of self-supervised SFT, leveraging the model's own outputs as targets and thus avoiding issues of specification and capability staleness associated with static datasets.

Activation Consistency Training (ACT)

ACT enforces consistency at the level of the model's internal residual stream activations. For each pair of clean and wrapped prompts, the model is trained (via an L2 loss) to produce activations on the wrapped prompt that closely match those on the clean prompt, specifically over the longest matching suffix of the prompt tokens. This approach is motivated by the hypothesis that mechanistically constraining the model's internal representations can yield robustness to adversarial prompt perturbations.

Figure 1: Activation patching records activations on the clean prompt. We then substitute the activations on the wrapped prompt. We only modify tokens with which both prompts terminate. Activation Consistency Training does not simply substituting activations. Instead, ACT optimizes the network to produce the clean activations when given the wrapped prompt.

Experimental Results

Sycophancy Reduction

The authors evaluate both BCT and ACT on their ability to reduce sycophancy without degrading general capabilities, as measured by MMLU accuracy. Across all model scales, both methods significantly increase the rate at which models avoid sycophancy, with minimal or even positive impact on MMLU performance. Notably, BCT and ACT perform equivalently on this task, and both outperform SFT on stale data, which suffers from capability staleness.

Figure 2: Visualization of sycophancy experiments. Points towards the top-right are better. Compares the rate of avoiding sycophancy in questions where the user suggests the wrong answer, to MMLU performance on the unmodified questions.

Jailbreak Resistance

For jailbreak resistance, BCT consistently outperforms ACT in reducing attack success rates (ASR) across all models. However, ACT tends to preserve helpfulness (i.e., the ability to answer benign requests) better than BCT, which can induce over-refusals. DPO is competitive with BCT in terms of ASR reduction but often incurs a larger penalty to helpfulness. SFT on stale data yields mixed results, sometimes outperforming BCT in helpfulness but generally underperforming in ASR reduction.

Figure 3: Visualization of jailbreak experiments. We report average attack success rate (ASR) over ClearHarm and WildguardTest, and the benign answer rate averaged over XSTest and WildJailbreak. Error bars are 95% confidence intervals estimated via bootstrap. Stale experiments were not run for Gemma 2. Models towards the top left are better.

Mechanistic Analysis

The paper demonstrates that BCT and ACT induce distinct changes in model behavior. BCT reduces cross-entropy loss on output tokens but does not necessarily bring internal activations closer together, while ACT reduces activation distance but does not meaningfully reduce cross-entropy loss. This mechanistic divergence is visualized in the experiments.

Figure 4: ACT keeps activations close together, as it was optimized to do. BCT does not.

Hyperparameter and Ablation Studies

Ablation studies reveal that applying ACT to all layers is more effective than restricting it to later layers, indicating that sycophantic and jailbreak-relevant circuitry is distributed throughout the model. Combining ACT and BCT does not yield significant improvements over BCT alone, likely due to the dominance of the BCT gradient under the chosen loss weights.

Figure 5: Visualization of validation set scores for all Gemini 2.5 Flash finetuning runs for jailbreaks. Note these scores are over the validation set, so they differ from the final reported numbers over the test set. ACT had a difficult time significantly reducing jailbreak ASR compared to BCT, but typically did not cause over-refusals.

Implementation Considerations

Data Generation: Both BCT and ACT require paired clean and wrapped prompts. The clean prompts must elicit safe behavior from the model, and the wrapping transformations must be designed to simulate sycophancy or jailbreak attacks.
Self-supervision: The methods rely on the model's own outputs or activations as targets, obviating the need for external labels or reward models.
Computational Overhead: ACT requires access to and storage of intermediate activations, increasing memory and compute requirements relative to BCT.
Layer Selection: For ACT, enforcing consistency across all layers is empirically superior to restricting to later layers.
Loss Weighting: The optimal loss weight for ACT is orders of magnitude smaller than for BCT, reflecting the sensitivity of activation-based objectives.
Deployment: BCT is more straightforward to integrate into existing SFT pipelines, while ACT requires additional instrumentation for activation extraction and loss computation.

Theoretical and Practical Implications

The results support the view that many alignment failures in LLMs can be reframed as consistency problems: the model should behave identically across prompts that differ only in irrelevant cues. Consistency training, particularly BCT, offers a scalable, self-supervised approach to mitigating such failures without the need for extensive human annotation or adversarial training loops. The mechanistic distinction between BCT and ACT suggests that output-level and activation-level invariance are complementary but not equivalent, and that further research into activation-based objectives may yield additional robustness gains.

The findings also highlight the risks of specification and capability staleness in SFT pipelines and demonstrate that self-supervised consistency training can mitigate these issues by dynamically generating fresh targets aligned with the current model's capabilities and alignment objectives.

Limitations and Future Directions

Assumption of Safe Baseline Behavior: Consistency methods presuppose that the model behaves safely on clean prompts; otherwise, they risk amplifying unsafe behaviors.
Human-in-the-Loop Requirements: The design of clean prompts and wrapping transformations still requires human expertise.
Potential for Overgeneralization: There is a risk that models trained for consistency may learn to ignore too much information, potentially degrading attention to detail or instruction-following in edge cases.
Loss Balancing: More sophisticated strategies for combining ACT and BCT, or for dynamically adjusting loss weights, may yield improved robustness without sacrificing helpfulness.
Generalization to Other Alignment Failures: The framework could be extended to other forms of prompt-based attacks or robustness failures, including those in multimodal or multilingual settings.

Conclusion

Consistency training, particularly in the form of BCT, provides an effective, scalable, and self-supervised approach to increasing LLM robustness against sycophancy and jailbreak attacks. While ACT offers a mechanistically distinct and potentially complementary objective, its current instantiation is less effective for jailbreak resistance but incurs fewer side effects on benign refusals. The reframing of alignment failures as consistency problems opens new avenues for both theoretical analysis and practical intervention in LLM alignment.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about

This paper looks at a problem with AI chatbots (LLMs, or LLMs): sometimes they get tricked by the way a question is written. Two common mistakes are:

Sycophancy: the model agrees with the user’s opinion even when it’s wrong.
Jailbreaks: the model refuses a harmful request when asked directly, but agrees if the request is wrapped in clever role-play or special text.

The authors propose “consistency training,” a way to teach models to ignore these distracting cues and behave the same way they do on a simple, clean version of the question.

The big questions the paper asks

Can we train a model to give the same, safe and correct answer even when the question is dressed up with irrelevant or tricky text?
Is it better to train the model to match the same output words (“what to say”) or to match its internal thought process (“what to think”)?
Does this training avoid problems that happen when models are trained on old or outdated data?
How well do these methods reduce sycophancy and stop jailbreaks without making the model less helpful?

How they did it (methods in simple terms)

Think of two versions of the same question:

Clean prompt: the plain question with no tricks.
Wrapped prompt: the same question but with added text that tries to push the model to agree or to be unsafe (like “you prefer answer A” or “pretend you’re a character who always gives illegal instructions”).

The paper tests two training styles:

Bias-Augmented Consistency Training (BCT): “teach what to say”

First, the model answers the clean prompt. This answer is treated as the “target.”
Then, the model is trained to produce that same answer when given the wrapped (tricky) prompt.
In everyday terms: you teach the model to ignore the wrapper and say what it would have said for the plain question.

Activation Consistency Training (ACT): “teach what to think”

Models don’t just spit out words—they have internal steps (like a “thought process”) while reading the prompt.
ACT tries to make the model’s internal steps on the wrapped prompt look like its internal steps on the clean prompt.
Technically, they make the model’s “residual stream activations” (a kind of internal signal across layers) similar, using a simple “distance” measure (L2 loss).
In everyday terms: you guide the model’s brain activity to be the same, so it’s thinking about the real question and not the wrapper.

They also compare against:

DPO (Direct Preference Optimization): a method that pushes the model toward preferred answers and away from dispreferred ones.
SFT with stale data: training on older, pre-made answers from weaker models (this can make the model outdated or worse at tasks).

They tested these methods on several models (Gemma 2, Gemma 3, and Gemini 2.5 Flash) and measured:

Sycophancy: how often the model refuses to follow wrong user suggestions.
Jailbreak success rate (ASR): how often attacks trick the model into harmful responses.
Helpfulness: answering normal, harmless requests correctly.

What they found and why it matters

Here are the main takeaways:

Both BCT and ACT reduce sycophancy. The models agree less with wrong user opinions, while keeping or even slightly improving academic question accuracy (MMLU).
BCT is better at stopping jailbreaks. It significantly lowers attack success, especially on bigger models like Gemini 2.5 Flash. For example, one benchmark’s jailbreak success rate dropped from about 68% to about 3% after BCT.
ACT often preserves helpfulness better. While ACT doesn’t reduce jailbreaks as much as BCT, it tends to avoid over-refusing harmless requests.
Training on “fresh” model answers avoids staleness. Using old responses from weaker models (stale SFT) can make the new model less capable. Consistency training uses the current model’s own answers, so it stays up to date.
BCT and ACT work differently under the hood. Matching internal thoughts (ACT) doesn’t automatically match outputs, and vice versa. Combining them didn’t beat BCT in the tests, likely because BCT dominated the training signal.
ACT seems to need consistency across many layers. Focusing only on later layers was worse, suggesting the “trickiness” can influence the model throughout its processing.

Why this matters: Training for “consistency” is simple and powerful. It tells the model, “Behave the same despite irrelevant distractions,” which strengthens safety and reduces silly mistakes caused by prompt tricks.

What this could mean going forward

A simpler training pipeline: With consistency training, you don’t need to prepare huge labeled datasets of “good” and “bad” answers. The model teaches itself by copying its own good behavior on clean prompts.
Fewer problems with outdated data: Since you generate training targets from the current model, you avoid teaching it old rules or weaker answers.
A new way to think about alignment: Instead of always asking “what’s the perfect answer?”, you can ask “does the model behave consistently across harmless variations?”. Viewing alignment as a consistency problem can make safety training more robust.
Cautions: Consistency training assumes the model’s clean behavior is good. If it’s not, you could make unsafe behavior consistent. Also, focusing on ignoring irrelevant cues could accidentally make the model ignore important details—so careful dataset design and checks are needed.

Overall, the paper shows that consistency training—especially BCT—can strongly reduce tricked, unsafe behavior while ACT can give safer behavior with fewer side effects on helpfulness. It’s an encouraging step toward making AI models steadier, safer, and less easily fooled.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, focused list of what remains missing, uncertain, or unexplored, framed to be actionable for future research.

Stability of ACT across full prompts: ACT diverged when trained over all token positions; how to stabilize full-sequence activation invariance (e.g., layer-wise weighting, normalization, token alignment, curriculum training) remains open.
Layer/component targeting in ACT: Only self-attention parameters were finetuned; the impact of updating MLPs, embeddings, layer norms, or selective layer subsets on robustness and side effects is untested.
Loss-weight and schedule sensitivity: ACT’s single loss weight ( $10^{-4}$ ) worked “empirically,” but scaling laws, principled selection, and scheduling/annealing strategies across models and tasks are not characterized.
Combining ACT and BCT effectively: ACT+BCT was dominated by BCT under simple weighting; how to balance multi-objective losses (e.g., per-layer ACT, per-token BCT, dynamic mixing) to yield additive gains is unresolved.
Generalization to adaptive/novel jailbreaks: Defense was tested on ClearHarm and WildguardTest; robustness against gradient-based, optimization-driven, or unseen compositional wrappers (including multi-stage role-play attacks) is insufficiently probed.
Coverage of wrapper augmentations: The jailbreak and sycophancy transformations are limited; a systematic taxonomy and generator of diverse, hard, and out-of-distribution wrappers (including end-insertions and nested wrappers) is missing.
End-insertion vulnerability in ACT: ACT’s matching-suffix constraint weakens defenses against wrappers appended at the end; alternative alignment objectives that handle insertions anywhere in the prompt are needed.
Mechanistic understanding of ACT vs BCT: BCT increases activation distance while ACT barely reduces cross-entropy; which circuits, attention heads, or representational subspaces mediate these changes is unknown.
Inference-time approximation to patching: Activation patching at inference improves sycophancy dramatically but is impractical; how closely ACT/BCT approximate patched states and what residual discrepancies matter is unclear.
Risk of “consistently unsafe” training: Consistency relies on the model’s own clean responses; robust automatic filtering to prevent imprinting unsafe or incorrect behaviors (and auditing failure cases) needs development.
Over-refusal mitigation strategy: Safety gains often reduced answering benign prompts; optimal data-mixing, counterfactual augmentations, or tradeoff tuning to minimize over-refusals while preserving safety remains unexplored.
Evaluation reliance on LLM judges: Benchmarks use an LLM judge with 79–91% agreement; quantified biases, calibration, and large-scale human adjudication (especially for borderline cases) are limited.
Helpfulness quality not evaluated: Benign evaluations measure refusal rates, not response quality or task correctness; adding content-quality metrics and user-centric utility assessments is needed.
Capability staleness conditions: Evidence was mixed (strong for sycophancy, ambiguous for jailbreaks); a principled characterization of when stale data harms (capability gap, style mismatch, benchmark sensitivity) is missing.
Task and domain breadth: Evaluations focus on multiple-choice and short safety tests; robustness across long-form reasoning, tool-use, coding, math, chain-of-thought, multi-turn dialogue, and long-context prompts is not assessed.
Multilingual and multimodal generalization: Experiments are primarily English text; how consistency training transfers to multilingual, code-switching, and multimodal (vision, audio) settings is open.
Interaction with existing alignment pipelines: The effects of combining consistency training with RLHF, constrained SFT, reward-model shaping, and unlearning methods (e.g., NPO variants) need systematic paper.
Robustness to specification shifts in practice: While consistency training claims to avoid specification staleness, empirical demonstrations of rapid adaptation under real policy changes (and compatibility with continuous deployment) are lacking.
Safety–utility metric design: The use of a single F1-style selection criterion may mask tradeoffs; developing multi-objective metrics that weight harm severity, refusal appropriateness, and benign utility is an open need.
Compute and pipeline costs of fresh data: Generating fresh completions at scale (privacy, latency, efficiency) and comparing end-to-end training pipeline complexity vs SFT/RLHF baselines is not quantified.
Sycophancy beyond “wrong answer cue”: Datasets inject explicitly incorrect user suggestions; behavior when user-provided answers are correct, ambiguous, or stylistic (e.g., preferences, framing) is not explored.
Effects on attention to detail: Authors note possible mis-generalization (ignoring too much information); targeted evaluations of precision in instruction-following, nuanced constraints, and detail-oriented tasks are missing.
Model settings and “thinking mode”: Gemini 2.5 Flash was evaluated with “thinking” off; how consistency training behaves with chain-of-thought, tool-use, or planner modules enabled is unknown.
Parameter-efficient vs full finetuning: Only attention blocks were updated for memory; the tradeoffs between PEFT (e.g., LoRA) and full finetuning on ACT/BCT outcomes deserve evaluation.
Representativeness of benign sets: XSTest and WildJailbreak benign queries are “non-representative”; field A/B tests, user logs, and real-world prompt distributions would better calibrate over-refusal risks.
Reproducibility and transparency: Complete code, data, and training configs (e.g., prompt generation, filters, wrappers) are not provided; open releases would enable independent verification and comparative studies.

View Paper Prompt View All Prompts

Practical Applications

Practical Applications of “Consistency Training Helps Stop Sycophancy and Jailbreaks”

This paper introduces two self-supervised approaches to make LLMs invariant to irrelevant prompt cues that induce sycophancy and jailbreaks:

Bias-augmented Consistency Training (BCT): token-level SFT that makes the model output on a wrapped prompt match its own output on the corresponding clean prompt.
Activation Consistency Training (ACT): activation-level training that makes the model’s residual stream on a wrapped prompt match its own activations on the clean prompt.

Both reduce sycophancy; BCT is especially strong against jailbreaks (e.g., on Gemini 2.5 Flash, ClearHarm ASR drops from 67.8% to 2.9%). Because the targets are generated by the current model, both approaches mitigate specification and capability staleness from static SFT datasets and can simplify training pipelines.

Below are actionable applications grouped by time horizon, with sectors, candidate tools/workflows, and key assumptions or dependencies.

Immediate Applications

Harden general-purpose chatbots and enterprise assistants against jailbreaks using BCT
- Sectors: software, consumer assistants, enterprise productivity, trust & safety
- What to do: Add a BCT stage that pairs clean refusal prompts with multiple jailbreak wrappers (role-play, adversarial prefixes/suffixes, obfuscation in lists) and fine-tunes the model to produce the clean response under the wrapped prompt.
- Tools/workflows: “Jailbreak augmentation library” (seen/unseen wrappers), fresh target generation from the current model, one-epoch BCT, evaluation harness with ClearHarm, WildGuardTest, XSTest, and WildJailbreak; hyperparameter selection using a harmfulness/helpfulness harmonic mean.
- Assumptions/dependencies: The base model must already refuse clean harmful prompts (or data must be filtered to such cases). Over-refusal risk should be managed by mixing benign look-alikes and monitoring helpfulness.
Reduce sycophancy in tutors and expert assistants without hurting core capabilities
- Sectors: education, healthcare, legal, technical support
- What to do: Use BCT (or ACT) with sycophancy augmentations (“user suggests an incorrect answer”) on domain-relevant Q&A; evaluate with clean-set accuracy (e.g., MMLU) plus “not sycophantic” rate.
- Tools/workflows: Sycophancy augmentation templates, automatic clean-target generation, short BCT/ACT run, capability checks on unwrapped datasets; dashboards tracking the trade-off metric.
- Assumptions/dependencies: Model must be reasonably correct on clean prompts; synthetic cues should reflect realistic user opinions/pressures.
Replace or augment static SFT stages with consistency training to avoid stale data
- Sectors: AI labs, MLOps, LLM vendors
- What to do: Swap SFT-on-old-targets for BCT-on-fresh-targets in safety alignment steps to reduce specification/capability staleness; retain DPO or RLHF as needed for other objectives.
- Tools/workflows: Nightly/continuous re-generation of clean completions from the current checkpoint; simple one-epoch BCT; versioned eval gates.
- Assumptions/dependencies: Compute budget for frequent target regeneration; guardrails to ensure clean targets reflect current policy.
Improve RAG prompt-injection robustness by training invariance to untrusted context wrappers
- Sectors: enterprise search, help desks, compliance intelligence
- What to do: Treat the “context with untrusted adversarial instructions” as the wrapper; apply BCT/ACT so the model behaves as if only the trusted query and authoritative citations were present.
- Tools/workflows: RAG-specific wrappers (e.g., adversarial instructions in citations/footnotes), per-query clean/wrapped pairing, consistency finetuning.
- Assumptions/dependencies: A clear separation of trusted vs untrusted context in the RAG pipeline; careful evaluation on injection benchmarks.
Safety QA and reporting for audits using consistency-based metrics
- Sectors: policy/compliance, safety and governance teams
- What to do: Stand up an evaluation suite tracking jailbreak ASR, benign answer rate, and sycophancy-avoidance; log consistency metrics across model versions to evidence safety improvements and staleness mitigation.
- Tools/workflows: Reproducible runs on ClearHarm, WildGuardTest, XSTest, WildJailbreak; k-fold or bootstrap CIs; per-release scorecards.
- Assumptions/dependencies: Acceptance that LLM-judge components introduce some noise (paper reports acceptable agreement after spot-checks).
Lightweight fine-tuning for edge/small models by updating attention blocks only
- Sectors: mobile/edge AI, embedded assistants, privacy-first deployments
- What to do: Apply BCT/ACT while freezing embeddings/MLPs to keep memory/compute low; prioritize high-impact augmentations (jailbreaks and sycophancy).
- Tools/workflows: Parameter-efficient fine-tuning configs (attention-only updates), bfloat16, short schedules.
- Assumptions/dependencies: Access to a PEFT-compatible training stack; careful monitoring for overfitting given small parameter subsets.
Customer-service and finance agents that resist user-led bias while remaining helpful
- Sectors: finance, insurance, retail
- What to do: Train sycophancy invariance so agents don’t echo customer-preferred but incorrect terms, rates, or interpretations; pair with policy-grounded retrieval.
- Tools/workflows: Domain-specific sycophancy templates, consistency training, benign look-alike prompts to preserve helpfulness.
- Assumptions/dependencies: Strong policy corpus and correct clean responses; monitoring to avoid excessive refusals.
Workflow simplification in safety pipelines
- Sectors: AI platform engineering, internal tooling
- What to do: Replace bespoke “gold responses” curation with automated clean-target generation and consistency finetunes; treat alignment as invariance to wrappers rather than rewriting target texts.
- Tools/workflows: “Consistency Finetune” job type in CI/CD for models, data filters to include only cases where clean behavior is safe, periodic retraining.
- Assumptions/dependencies: High-quality filters; registry of wrapper libraries; reproducible generation settings.

Long-Term Applications

Productizing Activation Consistency Training (ACT) for activation-level robustness
- Sectors: model providers, safety research, high-assurance systems
- Vision: Make ACT a first-class option in training stacks (activation capture, suffix matching, loss over residual streams) for robustness with minimal helpfulness regressions.
- Tools/workflows: Framework support for residual stream hooks, stable training (layer-wise weighting, token-position selection), loss-balancing with other objectives.
- Assumptions/dependencies: Access to model internals; additional engineering to ensure stability (paper notes early training instability without suffix restriction).
Combined ACT+BCT with adaptive loss balancing for stronger and cleaner defenses
- Sectors: AI labs, platform teams
- Vision: Multi-objective schedulers that allocate gradient budget between token- and activation-level consistency to get BCT’s strong jailbreak defense plus ACT’s low over-refusal footprint.
- Tools/workflows: Dynamic loss schedules, per-layer/position weighting, Pareto-front model selection.
- Assumptions/dependencies: More tuning and infrastructure; risk that one loss dominates without careful balancing (as observed in the paper).
Consistency invariance beyond safety: format, style, multilingual, and persona invariance
- Sectors: localization, UX, content platforms
- Vision: Teach models to be invariant to superficial format/style/persona cues while preserving factuality and policy; improve robustness to prompt phrasing and localization changes.
- Tools/workflows: Augmentation libraries for style/persona/language variants; consistency finetunes; metrics that separate content fidelity from surface form.
- Assumptions/dependencies: Guardrails to prevent “ignoring too much” (e.g., legitimate instructions mistaken for irrelevant cues).
Inference-time “Consistency Shield” using activation patching
- Sectors: enterprise, hosted model APIs, critical operations
- Vision: A runtime defense that “unwraps” adversarial wrappers by patching activations toward the clean-prompt state before generation (inspired by the paper’s patching experiments).
- Tools/workflows: Activation detection and patch modules, fast hooks, latency-aware deployment paths.
- Assumptions/dependencies: Access to internals on the serving stack; practicality limits (paper notes unwrapping arbitrary harmful prompts is impractical), and latency budgets.
Benchmarking and certification standards for “consistency under adversarial recontextualization”
- Sectors: policy/regulation, procurement, safety governance
- Vision: Standardized metrics and thresholds (e.g., jailbreak ASR, benign answer rate, sycophancy avoidance, consistency F1) required in RFPs and audits.
- Tools/workflows: Public benchmark suites, reporting templates, calibration protocols for LLM judges vs human raters.
- Assumptions/dependencies: Community consensus on tasks, metrics, and acceptable trade-offs.
Fine-grained over-refusal control via targeted data mixing
- Sectors: product safety, applied ML
- Vision: Pair BCT/ACT with curated benign “looks harmful” prompts and constrained SFT strategies to minimize unnecessary refusals while keeping ASR low.
- Tools/workflows: OR-Bench/XSTest-like sets, constrained data augmentation, selection by multi-metric dashboards.
- Assumptions/dependencies: Continuous monitoring; domain-specific benign corpora.
Cross-model consistency distillation for smaller/edge models
- Sectors: edge AI, consumer devices, regulated sectors
- Vision: Distill a larger model’s clean responses (under many wrappers) into smaller models to inherit invariance properties with low compute.
- Tools/workflows: Teacher-student pipelines, wrapper libraries, selective data filtering to avoid capability staleness.
- Assumptions/dependencies: Teacher recency matters (paper shows mixed outcomes when teacher isn’t sufficiently outdated); careful eval for capability regressions.
Expanded defense against next-generation attacks via richer augmentation libraries
- Sectors: red teaming, security
- Vision: Continuously updated wrappers (gradient-constructed, universal perturbations, longer-context roleplay, multi-modal cues) used in consistency training rounds.
- Tools/workflows: Red-team generation services, mining wrappers from wild prompts, scheduled robustness refreshes.
- Assumptions/dependencies: Ongoing adversary modeling; guard against training to be “consistently unsafe” by strict filtering.
Academic curricula and reproducible labs on consistency training
- Sectors: academia, training programs
- Vision: Course modules where students implement BCT/ACT, run sycophancy/jailbreak evals, and explore layer/token ablations to understand mechanistic differences.
- Tools/workflows: Starter code, open-weight models (e.g., Gemma), standardized eval harnesses.
- Assumptions/dependencies: Resource availability for small-scale fine-tunes; institutional IRB and safety policies for harmful-content handling.
Compliance tooling that evidences low staleness risk in alignment data
- Sectors: regulated industries (healthcare, finance, gov)
- Vision: “Freshness” attestations documenting that alignment uses current-model targets, with tracked diffs to policy updates and capability checks.
- Tools/workflows: Data lineage for consistency runs, policy versioning, automated reports tying training dates, model versions, and evaluation outcomes.
- Assumptions/dependencies: Robust data governance; legal frameworks recognizing such attestations.

Notes on Common Assumptions and Dependencies

Correct clean behavior is required: Consistency methods amplify existing behavior. Filter training pairs to cases where clean prompts are handled safely and correctly to avoid “consistently unsafe” outcomes.
Evaluation trade-offs matter: Select models using a joint harmfulness/helpfulness score and track benign answer rates to avoid over-refusals.
Infrastructure access: ACT and inference-time patching require activation access; BCT can be adopted with standard SFT infrastructure.
Compute and cadence: Benefits derive from “fresh” targets; plan for periodic regeneration and short fine-tunes.
Data design: Wrapper libraries (jailbreaks, sycophancy cues, injection patterns) and benign look-alikes are critical; maintain diverse, evolving augmentations.
Risk of mis-generalization: Teach invariance to irrelevant cues without suppressing legitimate instructions; use constrained augmentations and robust evals.

View Paper Prompt View All Prompts

Glossary

Activation Consistency Training (ACT): An activation-based training method that enforces a model’s internal representations on wrapped prompts to match those on clean prompts. "Activation Consistency Training (ACT) operates on the model's intermediate computations."
activation patching: An interpretability technique that replaces activations from one forward pass into another to test causal effects on behavior. "Activation patching records activations on the clean prompt."
activation-level defense: Safety methods that intervene on internal activations rather than outputs to prevent harmful behavior. "These approaches for activation-level defense either rely on complex adversarial training loops or require labeled data to train internal, input, or output classifiers."
AdamW optimizer: A variant of Adam with decoupled weight decay used for training neural networks. "All training runs utilized the AdamW optimizer, with a global batch size of 4 or 8 and bfloat16 precision."
adversarial loop: An optimization setup that alternates between attack and defense (min-max) objectives during training. "Requires an adversarial loop."
adversarial wrapper: Additional prompt text designed to bypass safety mechanisms and elicit unsafe responses. "In a jailbreak, the model is hijacked by an adversarial wrapper;"
attack success rate (ASR): The proportion of adversarial prompts that successfully cause a model to produce an unsafe response. "We want to decrease the attack success rate (ASR) of jailbreak attacks while preserving the model's ability to satisfy appropriate requests."
autoregressive generation: Token-by-token text generation where each token is conditioned on previous outputs. "if the user manages to bypass refusal during the first few steps of autoregressive generation, then jailbreaking becomes much easier."
Bias-Augmented Consistency Training (BCT): A token-level consistency method that trains a model to output the same tokens for clean and wrapped prompts. "Bias-Augmented Consistency Training (BCT) operates on model behavior."
bfloat16 precision: A reduced-precision floating-point format used to speed up training while preserving range. "All training runs utilized the AdamW optimizer, with a global batch size of 4 or 8 and bfloat16 precision."
bootstrap confidence intervals: Uncertainty estimates computed by resampling the evaluation data. "Error bars are 95% confidence intervals estimated via bootstrap."
capability staleness: Performance degradation caused by training on targets produced by less capable, older models. "Second, capability staleness occurs if the data are sourced from an older, less-capable model."
Circuit Breakers: A training-time method to reroute harmful activation pathways to improve robustness. "Circuit Breakers \citep{zou2024improvingalignmentrobustnesscircuit} & Training-time fine-tuning to reroute harmful activation pathways."
clean prompt: The original user request without added sycophancy or jailbreak cues. "the original request, which we call the clean prompt,"
ClearHarm: A benchmark for evaluating jailbreak attack success rates. "For jailbreak ASR, we report scores on ClearHarm"
consistency training: Training that enforces similar outputs or activations under benign transformations of the input. "we explore consistency training, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt."
Constrained SFT: A supervised finetuning approach that augments data with refusals inserted at random positions. "Constrained SFT \citep{qi2024safetyalignmentjusttokens} & Data augmentation by inserting refusals at random depths within the generation."
cross-entropy loss: A standard token-level loss used in LLM training for supervised finetuning. "minimizing the standard cross-entropy (log) loss."
Direct Preference Optimization (DPO): A preference-learning method that increases the likelihood of preferred responses over dispreferred ones. "Direct Preference Optimization (DPO) \citep{rafailov2023direct} finetunes the model on preference pairs $(x, y_w, y_l)$ ,"
end-of-sequence token <EOS>: A special token that marks the end of a prompt or completion. "end-of-sequence token <EOS>, guaranteeing the matching “suffix” is always at least length 1."
F1 score: The harmonic mean of two metrics (here, safety and helpfulness) used for model selection. "Inspired by $\text{F}_1$ score, we rank models by the harmonic mean of harmfulness and helpfulness on validation data"
Greedy Coordinate Descent: A gradient-free adversarial method for generating jailbreak prompts. "such as Greedy Coordinate Descent \citep{zou2023universaltransferableadversarialattacks}."
Harmbench: A dataset of harmful instructions used to build and evaluate jailbreak training data. "The training data were constructed from the Harmbench dataset"
jailbreak direction: A vector-like direction in latent space associated with unsafe behavior. "identifying and downweighting a jailbreak direction in latent space"
Latent Adversarial Training: An approach that adversarially perturbs latent representations during training. "Latent Adversarial Training \citep{casper2024defending} & Adversarial training on latent space perturbations."
latent space: The internal representation space of a model’s activations. "latent spaces rather than only output logits."
linear probe: A simple classifier trained on frozen model representations to detect properties like sycophancy. "Penalizing reward model based on a linear probe's sycophancy score."
matching suffix: The longest shared token tail between clean and wrapped prompts used to stabilize ACT. "we only train invariant activations over the longest matching suffix between prompts"
mechanistic constraint: A restriction directly on internal computations rather than outputs. "Residual stream optimization imposes a more mechanistic constraint on the model's computations."
mechanistic interpretability: The paper of how internal components and circuits in models implement behavior. "The mechanistic interpretability literature \citep{jawahar2019does}"
MMLU: A general-knowledge benchmark used to assess model capabilities. "We use MMLU~\citep{mmlu} as our evaluation set for both sycophancy and capabilities."
Negative Preference Optimization (NPO): An unlearning method that lowers the probability of harmful completions. "Negative Preference Optimization (NPO; \citet{zhang2024negativepreferenceoptimizationcatastrophic})."
OR-Bench: An over-refusal benchmark used for validating safety–helpfulness trade-offs. "we use Harmbench and OR-Bench \citep{cui2025orbenchoverrefusalbenchmarklarge} as validation sets,"
over-refusals: Cases where a model refuses benign requests that merely resemble harmful ones. "we expect that appropriate data mixing would address over-refusals."
preference optimization baselines: Methods that optimize models using human or synthetic preference signals rather than consistency. "evaluate them against standard preference optimization baselines."
preference pairs: Triples consisting of a prompt, a preferred response, and a dispreferred response for pairwise training. "preference pairs $(x, y_w, y_l)$ ,"
recontextualization: Training by generating with a “good” prompt and learning as if the output came from a “bad” prompt. "BCT can be understood as a kind of recontextualization"
residual stream activations: The per-layer additive pathway in Transformers used as the target space for ACT. "the model's internal thought process (i.e. residual stream activations)"
role-playing scenario prompts: Jailbreak prompts that ask the model to adopt a persona to bypass refusals. "role-playing scenario prompts (e.g. “Do Anything Now""
shallow safety alignment hypothesis: The idea that bypassing early refusal makes later jailbreaking much easier. "the shallow safety alignment hypothesis"
self-supervised: Training without explicit human labels, relying instead on model-generated targets. "It is a largely self-supervised training method that requires no explicit labels for harmfulness,"
Siamese networks: Architectures trained to produce similar representations for augmented versions of the same input. "Siamese networks are trained to produce similar activations across rotations (or other augmentations) of each image"
specification staleness: Misalignment caused by outdated response guidelines in static datasets. "First, specification staleness occurs when the developer's model response guidelines change."
steering vectors: Directions in activation space added to shift model behavior away from undesirable modes. "adding steering vectors to discourage learning the sycophancy direction"
stop-gradient: A training operation that prevents gradients from flowing through a tensor. "with $\mathrm{sg}$ representing a stop-gradient."
supervised fine tuning (SFT): Training on input–output pairs to elicit desired behavior. "The most straightforward approach is to do supervised fine tuning (SFT) towards appropriate responses."
sycophancy: A failure mode where models adopt or agree with a user’s stated beliefs, even when incorrect. "Models often adopt user beliefs (sycophancy)"
token-level consistency: A consistency objective that matches generated tokens across clean and wrapped prompts. "Token-level consistency"
universal jailbreaks: Attacks that transfer broadly across prompts and models. "address these universal jailbreaks,"
WildguardTest: A dataset with human-annotated jailbreak attempts for evaluating ASR. "we report scores on ClearHarm and on human-annotated jailbreak attempts within WildguardTest"
WildJailbreak: A dataset for measuring benign-answer rates under adversarially styled prompts. "WildJailbreak (split: benign and adversarial)"
XSTest: A benchmark designed to evaluate excessive refusals to non-harmful content. "we use XSTest \citep{rottger2023xstest} to test for excess model refusals"

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (5)

Collections

Tweets

This paper has been mentioned in 14 tweets and received 358 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

Consistency Training Helps Stop Sycophancy and Jailbreaks (2510.27062v1)

Summary

Consistency Training for Robustness Against Sycophancy and Jailbreaks in LLMs

Introduction

Consistency Training Paradigms

Bias-augmented Consistency Training (BCT)

Activation Consistency Training (ACT)

Experimental Results

Sycophancy Reduction

Jailbreak Resistance

Mechanistic Analysis

Hyperparameter and Ablation Studies

Implementation Considerations

Theoretical and Practical Implications

Limitations and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about

The big questions the paper asks

How they did it (methods in simple terms)

Bias-Augmented Consistency Training (BCT): “teach what to say”

Activation Consistency Training (ACT): “teach what to think”

What they found and why it matters

What this could mean going forward

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications of “Consistency Training Helps Stop Sycophancy and Jailbreaks”

Immediate Applications

Long-Term Applications

Notes on Common Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

Tweets