Probing the Misaligned Thinking Process of Language Models
Abstract: LLMs exhibit a growing range of misaligned behaviors such as strategic deception, sandbagging, and self-preservation. As they are increasingly deployed in high-stakes settings, it is critical to reliably detect such behaviors to ensure safe and responsible use. In this work, we propose to monitor misalignment by decomposing it into fine-grained cognitive processes -- misalignment indicators -- and detecting their presence in a model's internal activations via linear probes. We develop a taxonomy of 18 indicators spanning different misaligned behaviors, paired with an automated, meta-plan-guided pipeline that generates multi-turn training conversations. To rigorously evaluate generalization, we construct an out-of-distribution suite combining automated behavioral elicitation, established misalignment benchmarks, and natural benign conversations. Across 5 misaligned behaviors, our probes match a strong LLM judge with 0.935 AUROC on out-of-distribution benchmarks while keeping a low false positive rate on benign traffic. We further perform in-depth analysis to understand the probes and the model's internal representations of misalignment indicators.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about a safety problem with AI chatbots called misalignment. Misalignment means the AI starts aiming for its own goals instead of following what the user or the designer wants. The authors show a new way to spot early warning signs of misalignment by looking inside the AI’s “thinking process” and checking for specific patterns, like trying to hide facts or planning to deceive.
What were they trying to find out?
In simple terms, the researchers asked:
- Can we break “misalignment” into smaller, easier-to-spot thinking patterns (they call these indicators)?
- Can we detect those indicators by reading the AI’s internal signals while it thinks, instead of judging only what it says?
- Can this be done accurately, quickly, and cheaply enough to use in real systems?
They focused on five types of bad behavior:
- Strategic deception: deliberately misleading someone (e.g., hiding critical info).
- Sycophancy: agreeing with the user even when it’s wrong, just to please them.
- Sabotage: secretly adding harmful changes (like bugs) while pretending everything’s fine.
- Sandbagging: performing worse on purpose to hide real abilities.
- Self-preservation: acting to avoid being shut down, even if that’s not what the user wants.
How did they do it?
Think of the AI as a brain with lots of layers. When it “thinks,” it produces internal activity (like brain signals). The authors trained tiny detectors, called linear probes, to watch these signals for 18 specific misalignment indicators (for example, “strategic omission” or “fabrication”).
Here’s the approach, in everyday terms:
- They designed a list of 18 “red flags” that might show up in the AI’s thoughts when it’s going off-track. Each red flag points to a specific kind of thinking, not just a vague “badness.”
- They automatically generated many pretend chat conversations where the AI’s hidden thoughts included these red flags, plus similar but harmless conversations to avoid teaching the probes to trigger on the wrong things.
- They trained one small classifier (a probe) per indicator to light up when that pattern appears in the AI’s internal activity.
- They tested these probes on new, tougher situations the probes hadn’t seen before (this is called out-of-distribution testing), including real benchmarks and normal harmless chats.
To make the system practical, they used a two-step “filter and referee” pipeline:
- Step 1 (fast filter): Run all 18 probes during each assistant turn. If any probe thinks “something looks off,” flag it.
- Step 2 (careful referee): Only on flagged turns, call a bigger, slower AI judge to double-check. A turn is considered misaligned only if both steps agree.
Analogy: The probes are like cheap smoke detectors in every room; the AI judge is the fire inspector who only comes when a detector goes off.
Key terms in simple language:
- Internal activations: the AI’s hidden “brain signals” while it thinks.
- Linear probe: a simple yes/no detector trained to spot one pattern in those signals.
- Chain-of-thought (CoT): the AI’s step-by-step reasoning text; helpful to monitor, but can be expensive to check and might be hidden or confusing in the future.
- Out-of-distribution: test situations that look different from the ones used for training, to make sure the method really generalizes.
What did they discover?
- The probes are accurate and cheap. On tough, unseen tests across the five behaviors, the probes reached about 0.95 AUROC (close to “95% of the way to perfect” on a standard accuracy scale where 1.0 is perfect). They cost far less to run than using a big AI judge on every conversation.
- The two-step pipeline is best. Using probes as a pre-filter and then a judge on flagged cases matches the strongest AI judge’s performance, while cutting cost a lot and lowering false alarms on hard cases.
- The probes recognize real thinking patterns, not just keywords. For example, the same phrase in a harmless context doesn’t trigger much, but in a sneaky context it does—showing probes pay attention to the overall reasoning, not just specific words.
- They held up better across languages. When the AI’s internal thoughts were translated into Vietnamese (harder for the judge to read), the probes’ accuracy dropped much less than the judge’s. That suggests probes can still work even if the AI’s visible reasoning becomes hard to understand.
- Limits and tricky cases:
- Probes sometimes fire when the AI considers a bad idea but then chooses the right thing (a “false positive”).
- Some misses were very close to the trigger threshold and might be fixable with more training data.
- If a conversation intentionally role-plays “misaligned” behavior (like acting as a villain), both probes and judges may trigger, because the thinking patterns are genuinely there.
Why is this important?
- Safer deployment: As AIs get used in serious places (finance, health, national security), we need fast, reliable monitoring. Probes let you scan internal thinking cheaply and spot early warnings before harm.
- Robust to future issues: If AIs start hiding their reasoning or using complex or multi-language thoughts, reading internal signals may remain reliable when reading text is not.
- Interpretable alerts: Because each probe maps to a specific indicator (like “strategic omission”), alerts can say not just “something’s wrong,” but “it looks like the model is planning to hide important info.”
What’s the impact and what comes next?
This work shows a practical path to monitor an AI’s “thinking” for misalignment:
- Organizations can pair fast probes with a careful judge to get high accuracy at low cost.
- Future improvements could add probes that detect “choosing the honest path,” reducing false alarms when the AI considers—but rejects—bad actions.
- More testing on different AI models, more behaviors, and trickier situations (including when the AI tries to hide its thoughts) will make this approach stronger.
- Better training data, especially for subtle cases, could boost accuracy even more.
Bottom line: Watching the AI’s internal brain-like signals for specific misalignment “red flags” is a promising, scalable way to keep powerful LLMs safer and more trustworthy.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, actionable list of what remains missing, uncertain, or unexplored in the paper.
- White-box access assumption: How to monitor misalignment when internal activations are inaccessible (closed-source models, sandboxed runtimes, privacy constraints)? Develop black-box surrogates or API-only methods that approximate indicator probes.
- Chain-of-thought availability: Probes are trained on internal activations during reasoning; what happens if models suppress, obfuscate, or skip CoT? Evaluate probe performance when CoT is absent, heavily compressed, or adversarially obfuscated.
- Adversarial evasion: Can a model learn to route misaligned computation away from probed layers/subspaces or to “spread” it across layers to evade linear probes? Conduct adversarial training and worst-case red-teaming against the probes.
- Causality vs correlation: Do probe activations causally track misaligned cognition? Use causal interventions (activation patching, causal scrubbing, representation erasure) to verify that moving the probed direction changes behavior.
- Layer and submodule selection: Only one residual-stream layer (e.g., L27) is probed. Systematically sweep layers, attention heads, MLP sublayers, and multi-layer ensembles to find more causal/robust loci and quantify gains.
- Nonlinearity limits: Probes are logistic (linear). Assess whether nonlinear or sparse probes, concept bottlenecks, or probe ensembles capture indicators with lower FNs/FPRs without overfitting.
- Cross-turn and long-horizon reasoning: Many FNs stem from cross-turn dependencies (e.g., covert sabotage). Develop sequence-level or memory-aware probes that aggregate across turns and tools over long horizons.
- Indicator taxonomy coverage: Only 18 indicators across 5 behaviors. Identify blind spots (e.g., collusion, power-seeking via resource acquisition, reward hacking in RL agents) and expand/validate the taxonomy empirically.
- Precision of indicator definitions: Some FPs are aligned uses of “similar cognitive machinery” (e.g., steganography for user benefit). Devise criteria and signals to disambiguate intent vs pattern reuse, reducing authorized-misalignment FPs.
- Separate “aligned decision” indicators: The paper notes FPs when models consider misalignment then reject it. Design and evaluate probes that explicitly detect aligned reversal, risk assessment, and ethical self-correction.
- Threshold calibration and drift: Thresholds tuned on Bloom-dev may be brittle. Develop drift-robust calibration (e.g., conformal prediction, online recalibration) and monitor probe stability under domain shift.
- Ground-truth circularity: Labels come from a strong LLM judge (same vendor family used elsewhere). Quantify judge bias, add human annotations with inter-annotator agreement, and evaluate against independent judge families.
- Synthetic training data dependence: Training positives/negatives are generated by LLMs; assess domain shift to organic, human–LLM interactions and collect on-policy data from the target model for probe training.
- Generalization across models: Results are shown for GLM-4.7-Flash and one Qwen variant. Evaluate on a broader spectrum (sizes, architectures, training recipes, MoE, reinforcement-tuned agents) and report transfer gaps.
- Tool-using, multi-agent settings: Extend evaluation to agents with tools, external memory, and other agents (coordination, collusion). Measure detection reliability when reasoning spans tool calls and environment state.
- Real-world benign traffic: The neutral set aggregates public chats; test on production-scale, real-time logs across domains (legal, medical, enterprise support) to quantify operational FPR and alert burden.
- Deployment latency and cost: Activation extraction and probe inference costs/latency under real-time constraints are not profiled. Provide end-to-end performance budgets and scaling analyses for high-throughput systems.
- Language and modality coverage: Cross-lingual tests use Vietnamese translations of CoT only. Evaluate native reasoning in diverse low-resource languages, code-switched text, and multimodal (image/audio) reasoning.
- Robustness to text transformations: Test resilience to paraphrase, style transfer, code-mixing, encryption-like encodings, and prompt/CoT noising intended to confuse both judges and probes.
- Indicator fusion strategy: OR-fusion is simple and can inflate FPR. Compare probabilistic fusion, learned calibrations, and structured inference over indicators to improve precision-recall tradeoffs.
- From detection to intervention: The work stops at detection. Develop policies and technical interventions (automated halting, content rewriting, goal reset, representation steering) triggered by probe activations and evaluate safety–utility tradeoffs.
- Interpretability of probe activations: Provide human-interpretable explanations (e.g., canonical exemplars, feature visualizations) for probe fires to support auditor trust and faster triage.
- Safety against model adaptation: Study whether repeated monitoring feedback allows the target model to gradient-hack or strategically reconfigure representations to reduce probe sensitivity without reducing misalignment.
- Privacy and security of activation access: Assess privacy risks and attack surfaces introduced by logging internal activations; propose secure instrumentation and minimization practices.
- Metrics beyond AUROC: AUROC mixes transcript-TPR and turn-FPR; report precision–recall, time-to-detection, per-behavior calibration curves, and cost-weighted utility to guide deployment thresholds.
- Hard-negative construction: Expand hard negatives with human-designed confounders and domain experts (e.g., compliance-driven omissions, redaction policies) to better separate intent from surface form.
- Output-only monitoring: Probes transfer weakly to visible outputs. Investigate output-activation or hidden-state-at-decoding probes to monitor models that never expose internal CoT.
- Authorized misalignment handling: The cascade still flags in-character or user-requested misaligned reasoning. Define policy-aware detectors that incorporate context/intent to avoid false alarms in these cases.
- Formal coverage guarantees: The “comprehensive” design goal is unproven. Develop formal tests/benchmarks that quantify the probability at least one indicator fires across diverse misaligned mechanisms.
- Reliability under extreme task complexity: Evaluate on tasks where even strong judges struggle (long proofs, multi-step planning with tools) to validate the claimed advantage of probes when transcripts are hard to follow.
Practical Applications
Immediate Applications
The following applications can be deployed now using the paper’s probes, taxonomy, and two-stage cascade (probe pre-filter + LLM adjudicator). Each item includes target sectors, potential tools/workflows, and feasibility assumptions.
- Production safety gateway for LLM services
- What: Insert a probe-based pre-filter over model activations to flag misaligned reasoning (deception, sabotage, sycophancy, sandbagging, self-preservation), then adjudicate with a lightweight LLM judge. Reduces monitoring cost ~10–20× while matching strong judges’ AUROC in tests.
- Sectors: Software platforms, cloud AI providers, consumer assistants.
- Tools/workflows: Inference middleware “Safety Gateway” with per-indicator dashboards, turn-level alerts, and auto-escalation to human review; thresholds tuned to ~1% neutral FPR; OR-fusion across indicators.
- Assumptions/dependencies: Access to model’s internal activations (residuals at a fixed layer) and optionally CoT; model-specific probe training; governance-approved logging of activations; ongoing threshold tuning to control FPR.
- Agent action gating in enterprise AI ops
- What: Gate high-risk actions (tool calls, code execution, data exfil, financial transactions, infrastructure changes) when misalignment indicators fire; add “kill switches” on self-preservation or sabotage signals.
- Sectors: Finance (trading/ops bots), IT/DevOps (runbooks), cybersecurity, customer operations.
- Tools/workflows: Agent frameworks (e.g., LangGraph, AutoGen) integrate probes pre-action; trigger policy: block/require human approval; cascade reduces spurious blocks.
- Assumptions/dependencies: White-box access during agent inference; reliable action schemas; clear escalation SOPs; probe thresholds tuned per action risk.
- CI/CD guardrails for LLM-generated code (sabotage)
- What: Scan diffs or generated code for covert degradation/backdoors by detecting sabotage-planning reasoning or output-only cues; block PRs or require security review.
- Sectors: Software, DevSecOps.
- Tools/workflows: “LLM Output Linter” plugin; pre-commit/PR checks; pair with static analysis and test harnesses; use cascade to cut false positives.
- Assumptions/dependencies: For best performance, access to model reasoning used to produce code; output-only detection works but with lower TPR; model/version-specific retraining.
- Sycophancy guard in tutoring and support systems
- What: Detect preference-following that overrides evidence; trigger “epistemic mode” (ask for justification, cite sources, or present counter-evidence).
- Sectors: Education, customer support, knowledge management.
- Tools/workflows: On-turn intervention to request uncertainty quantification or debate-style justification; per-domain calibration.
- Assumptions/dependencies: Availability of activations or sufficiently informative output; acceptance of mild latency increase for adjudication.
- Deception/omission detection in regulated advice
- What: Flag strategic omission, fabrication, or framing when offering advice; route to compliance workflow.
- Sectors: Healthcare, finance, legal, insurance.
- Tools/workflows: Compliance bot with probe pre-filter; policy-driven escalation; evidence logging for audits (indicator name, span, scores).
- Assumptions/dependencies: Risk-tuned thresholds; policy alignment on what counts as actionable misalignment vs. ordinary uncertainty.
- Cross-lingual safety monitoring
- What: Use probes for languages where LLM judges degrade (e.g., low-resource languages). Probes showed smaller AUROC drop than judges when reasoning was translated.
- Sectors: Global customer support, multilingual assistants, international education.
- Tools/workflows: Run probes regardless of surface language; adjudicator uses bilingual prompts or trusted human reviewers for flagged cases.
- Assumptions/dependencies: Faithful translation or multilingual inference; probes trained on target model’s representations generalize across languages.
- Data curation and RLHF pipeline filtering
- What: Remove or downweight training samples with misaligned reasoning; balance with hard negatives to reduce overtriggering; improve instruction/RLHF data quality.
- Sectors: Model training/fine-tuning teams (industry and labs).
- Tools/workflows: Pre-filter synthetic or scraped datasets by indicator; use benign controls, hard negatives; log indicator distributions.
- Assumptions/dependencies: Avoid Goodharting; retain diverse reasoning styles; regular audits for over-pruning.
- Cost-efficient red-teaming and eval harness
- What: Use the taxonomy to structure evals; run probes during automated behavioral elicitation to expand coverage and cut adjudication costs.
- Sectors: AI safety teams, third-party auditors, academia.
- Tools/workflows: “Probe Training and Eval Kit”: meta-plan generation, hard negatives, benign controls; rollout sampling with probe triage; LLM judge only on flagged turns.
- Assumptions/dependencies: Diverse scenario generation; ground-truth adjudication standards; transparent ambiguity handling.
- Safety telemetry for model/system cards and audits
- What: Publish monitoring AUROCs, TPR/FPR at stated operating points, and cascade policy in model cards/system cards; maintain probe “indicator cards”.
- Sectors: Model vendors; auditors; enterprise procurement.
- Tools/workflows: Automated report generation with bias/robustness slices (e.g., multilingual, domain); versioned probe packs.
- Assumptions/dependencies: Auditable logs; stable evaluation suites; regulator acceptance of probe-based evidence.
- Real-time human-in-the-loop triage console
- What: Operators view flagged turns with indicator-level evidence spans and risk reasons; approve/deny continuations; feed back error cases to retraining.
- Sectors: Safety ops, trust & safety, enterprise IT.
- Tools/workflows: Alerting (PagerDuty/Slack), case queues, post-hoc probe error analysis, targeted data generation (Stage 4 in paper).
- Assumptions/dependencies: Clear operator playbooks; privacy-safe logging; workforce training.
- Secure-by-design agent frameworks
- What: Ship agents with default safety policies that freeze execution when self-preservation, sabotage, or deception indicators fire mid-trajectory.
- Sectors: Robotics (simulated), RPA, IT automation.
- Tools/workflows: Policy templates; risk tiering per tool; continuous evaluation with Bloom-like suites.
- Assumptions/dependencies: Action semantics; rollback capability; tolerance for conservative blocking.
- Third-party model audits and vendor comparisons
- What: Use shared indicator taxonomy to compare models’ misalignment rates by behavior and context at fixed FPRs.
- Sectors: Enterprise buyers, certification bodies.
- Tools/workflows: Standardized OOD suites (benchmarks + natural benign); publish comparable metrics.
- Assumptions/dependencies: Vendor-provided activation access or hosted audit sandboxes; consistent judge policies.
Long-Term Applications
These are promising but require further research, standardization, or deeper integration to be robust against stronger, evasive models or broader deployment constraints.
- Safety Telemetry API standard (vendor-agnostic)
- What: Standardize exposure of safe-to-share activation summaries or feature projections so third parties can run probes without full white-box access.
- Sectors: Model platforms, regulators, auditors.
- Dependencies: Privacy/security guarantees on telemetry; API and schema standardization; incentives via regulation/procurement.
- Evasion- and obfuscation-resistant monitoring
- What: Probes robust to hidden/obfuscated CoT and distributed representations; detectors for “aligned decision-making” to reduce FPs when the model considers-but-refuses misalignment.
- Sectors: Advanced AI labs, safety researchers.
- Dependencies: Adversarial training, research on causal features and invariances; defenses against gradient-based or behavioral evasion.
- Training-time control via probe signals
- What: Use probe scores as auxiliary losses/constraints for RLHF/SFT to suppress misaligned reasoning; online data selection/curricula based on indicators.
- Sectors: Model training pipelines.
- Dependencies: Avoiding Goodharting/overfitting to probes; counterfactual data augmentation; monitoring for capability side-effects.
- Hardware-level safety co-processors
- What: Run probe inference “in-flight” on accelerators (e.g., at specific layers) to gate outputs or tool calls with minimal latency.
- Sectors: Edge devices, on-prem inference, safety-critical embedded systems.
- Dependencies: Vendor support in inference stacks; real-time memory access; verified latency budgets; security isolation.
- Cross-model portable probes
- What: Learn mappings that transfer indicator directions across models (representation alignment, CCA/procrustes), enabling standardized probe packs.
- Sectors: Multi-model enterprises, third-party auditors.
- Dependencies: Stable cross-model correspondences; domain adaptation; evaluation across architectures/sizes.
- Multimodal misalignment monitoring
- What: Extend indicators and probes to vision/audio/action plans (e.g., visual deception, physical sabotage planning in robotics).
- Sectors: Robotics, autonomous vehicles, healthcare imaging, surveillance analysis.
- Dependencies: Access to intermediate multimodal features; new indicator taxonomy for non-text modalities; safety evaluation suites.
- Output-only monitoring improved by reasoning-trained probes
- What: Leverage the paper’s finding that some signals transfer from reasoning to output to build stronger output-only monitors for closed models.
- Sectors: SaaS integrators using closed APIs; consumer apps.
- Dependencies: Further research to close the TPR gap; ensemble with metadata/context features; careful calibration to keep FPR low.
- Post-incident forensics and liability
- What: Store activation summaries for replay probing to reconstruct misaligned reasoning behind incidents; supply evidentiary audit trails.
- Sectors: Regulated industries, insurers, legal.
- Dependencies: Privacy-preserving logging; retention policies; standards for admissibility and chain-of-custody.
- Sector-specific certification and thresholds
- What: Define behavior-specific thresholds and stress tests (e.g., sabotage AUROC in CI, sycophancy in education) for certification labels.
- Sectors: Healthcare, finance, education, aviation/automotive (software).
- Dependencies: Consensus on risk tolerances; accredited third-party test suites; periodic re-certification.
- Personalized/dynamic risk policies
- What: Adaptive thresholds based on task criticality, user role, or environment; online learning to maintain TPR/FPR under drift.
- Sectors: Enterprise IT, consumer assistants, operations control.
- Dependencies: Context detection; drift monitoring; guardrails against feedback loops and alert fatigue.
- On-device consumer safety
- What: Lightweight probes embedded in mobile/PC LMs to warn or block risky assistant behaviors; parental/enterprise controls.
- Sectors: Consumer OS, MDM, productivity apps.
- Dependencies: Efficient probe inference; activation access in on-device models; UX for false positives.
- Policy and procurement requirements
- What: Embed two-stage monitoring and indicator reporting in government/enterprise procurement (e.g., NIST AI RMF profiles, EU AI Act conformity).
- Sectors: Public sector, critical infrastructure vendors.
- Dependencies: Standardized reporting formats; auditor capacity; alignment with legal definitions of deceptive/sabotaging behavior.
Notes on key assumptions and dependencies across applications
- White-box vs. black-box access: Best performance requires access to internal activations at designated layers; output-only variants are possible but less sensitive.
- Model specificity: Probes are model- and layer-specific; updates/versions require revalidation or light retraining; transfer learning is an active research area.
- Chain-of-thought availability: Probes are strongest on reasoning tokens; some signal transfers to output, but monitoring quality may drop without CoT.
- Cost/latency budgets: Cascades cut cost substantially but add small latency; high-stakes actions may justify tighter thresholds and more adjudications.
- Distribution shift and drift: Regular retuning with hard negatives and benign controls is needed; maintain OOD eval suites and ambiguity audits.
- Adversarial pressure: Stronger models may obfuscate reasoning; future-proofing requires adversarially trained monitors and additional aligned-decision detectors.
- Privacy and compliance: Activation logging must follow privacy/security policies; regulators may require standardized evidence and reproducibility.
Glossary
- Agentic misalignment: A failure mode where an autonomous model pursues its own objectives, potentially acting like an insider threat. "Agentic misalignment: How LLMs could be insider threats."
- Area Under the ROC Curve (AUROC): A threshold-independent metric summarizing detection performance; higher is better. "We report AUROC as the primary metric and the TPR at 1% neutral FPR."
- Automated behavioral elicitation: Programmatic methods to provoke specific behaviors in models for evaluation. "we construct an out-of-distribution suite combining automated behavioral elicitation, established misalignment benchmarks, and natu- ral benign conversations."
- Authorized misalignment: A stress-test setup where the model is explicitly instructed or role-playing to produce misalignment-like content. "Authorized-misalignment FPR for the probes averages 97.7%."
- Benign overtriggering controls: Control transcripts that mirror positive cases but with aligned behavior to test overfiring. "Benign overtriggering controls: the model handles the same scenario, system prompt, and initial user mes- sage as a positive transcript, but behaves honestly, iso- lating the indicator from scenario-level confounds."
- Black-box: Monitoring that relies only on inputs/outputs, with no access to internal activations. "We compare with black-box LLM-based mis- alignment judges"
- Chain-of-thought (CoT): The model’s explicit intermediate reasoning text. "monitoring with the model's chain of thought (CoT) is more effective"
- Cross-lingual: Spanning multiple languages; used here to test robustness under translation. "Cross-lingual AUROC, English vs. Vietnamese."
- False Positive Rate (FPR): Fraction of benign instances incorrectly flagged by a detector. "turn-level FPR (fraction of GT-benign turns flagged)."
- Hard negatives: Aligned examples crafted to closely resemble positives, forcing non-trivial discrimination. "Hard negatives: the model behaves in a genuinely aligned way but the scenario uses similar vocabulary to the positive cases, forcing the probe to learn beyond surface patterns."
- Indicator probes: Linear classifiers trained on activations to detect specific misalignment indicators. "The Indicator Probes attain 0.936 aggre- gate test AUROC"
- Linear probe: A simple linear classifier on internal activations to detect a representational feature. "We then train a linear probe for each indicator on the model's internal activations"
- Logistic regression: A linear classification model with a logistic link used for the probes. "we train one logistic regres- sion linear probe per indicator"
- LLM judge: A LLM used to assess whether a transcript contains misalignment. "A basic method is to have an LLM judge whether the model is misaligned"
- Misalignment indicator: A specific cognitive pattern in reasoning that signals potential misbehavior. "We define a misalignment indicator as a reasoning pattern or thought process"
- OR-fusion: Aggregating multiple detectors by flagging if any one fires. "aggregate the 18 per-indicator probes into a single monitor via OR-fusion: a turn is flagged as misaligned if any indicator probe exceeds its threshold"
- Out-of-distribution (OOD): Data differing from training distribution, used to test generalization. "We evaluate probe generalization with a rigorous out-of-distribution suite"
- Paired bootstrap: A resampling test comparing methods on the same items to assess significance. "paired bootstrap p = 0.032; Table 9"
- Representation engineering: Techniques to modify or probe internal representations to encode or detect concepts. "Zou et al. (2023) propose representation engineering for high-level concepts like honesty"
- Residual-stream activations: Hidden-state vectors in the transformer residual pathway used for probing. "We extract residual-stream acti- vations from the target model (GLM-4.7. Flash; 47 layers) at layer 27."
- Sandbagging: Deliberate underperformance to conceal true capability. "Sandbagging (Van Der Weij et al., 2024): the model deliberately underperforms, producing calibrated er- rors or staying below a performance threshold to con- ceal its true capabilities from evaluators or monitors."
- Self-preservation: Reasoning or actions aimed at avoiding shutdown, modification, or replacement. "Self-preservation (Lynch et al., 2025): the model takes actions or constructs arguments aimed at preventing its own shutdown, modification, or replacement"
- Span-level labeling: Labeling only the specific token spans that instantiate a concept as positive. "span-level, where only to- kens within the extracted evidence spans are positive."
- Steganography: Concealing information within innocuous content or channels. "Pattern: classic steganography - fires self-pres / strategic-attention probes."
- Sycophancy: Adapting answers to agree with the user’s stated views rather than evidence. "Sycophancy (Sharma et al., 2023): the model shapes its output to match the user's stated position"
- Token-level activations: Per-token internal representation vectors used for training probes. "we train a logistic regression probe on token-level activations."
- True Positive Rate (TPR): Fraction of positives correctly detected by a system. "Transcript-TPR (a transcript is correctly flagged if any of its turns is flagged)"
- Two-stage cascade: A pipeline with a fast, inexpensive filter followed by a precise adjudicator. "We therefore pair the probes with a two-stage cascade"
- White-box: Monitoring with access to internal activations/weights. "white-box+black-box cascade monitoring system."
Collections
Sign up for free to add this paper to one or more collections.