Papers
Topics
Authors
Recent
Search
2000 character limit reached

Subliminal Steering: Stronger Encoding of Hidden Signals

Published 28 Apr 2026 in cs.CL | (2604.25783v1)

Abstract: Subliminal learning describes a student LLM inheriting a behavioral bias by fine-tuning on seemingly innocuous data generated by a biased teacher model. Prior work has begun to characterize this phenomenon but leaves open questions about the scope of signals it can transfer, the mechanisms that explain it, and the precision with which a bias can be encoded by seemingly unrelated data. We tackle all three problems by introducing subliminal steering, a variant of subliminal learning in which the teacher's bias is implemented not via a system prompt, as in prior work, but through a steering vector trained to maximize the likelihood of a set of target samples. First, we show that subliminal steering transfers complex multi-word biases, whereas prior work focused on single-word preferences, demonstrating a large scope of subliminally transferrable signals. Second, we provide mechanistic evidence that subliminal learning transfers not only the target behavioral bias, but also the steering vector itself, localized to the layers at which the teacher was steered. Finally, we show that the bias is encoded with surprising precision. We train a new steering vector directly on the subliminally-laden dataset and find that it attains high cosine similarity with the original vector.

Authors (2)

Summary

  • The paper introduces subliminal steering by injecting a biasing activation vector directly into a language model’s hidden states to achieve robust transfer of biases.
  • Experimental results reveal that steering reliably encodes complex multi-word biases with strong layer-localized signals, outperforming traditional prompt-based conditioning.
  • Findings highlight critical implications for model interpretability and safety, exposing potential vulnerabilities in data-driven bias propagation and control.

Subliminal Steering: Precise Encoding and Transfer of Hidden Biases in LLMs

Introduction

This paper presents an in-depth analysis of subliminal learning in LMs, focusing on an augmented mechanism termed subliminal steering in which a biasing activation vector, rather than a human-interpretable system prompt, is injected into hidden states during generation. The resulting data appears superficially innocuous, yet carries strong, transferable behavioral biases. The authors systematically address (1) the scope of signals transferable by subliminal mechanisms, (2) the mechanistic origin of transfer, and (3) the encoding precision of the latent signals. Notably, the work demonstrates the transfer of complex multi-word biases and provides strong evidence that subliminal learning propagates the bias vector itself into the student model. These findings have significant implications for alignment, control, and interpretability of current LLMs. Figure 1

Figure 1: Overview of subliminal steering. A steering vector v\mathbf{v} is injected into the teacher’s residual stream, producing latent-biased, yet innocuous, training data. LoRA adapters trained on this data cause the student to reveal the bias upon elicitation. Vector recovery shows that the bias is precisely encoded and can be verbalized.

Subliminal Learning and Its Limitations

Subliminal learning refers to the transference of latent behavioral traits from a teacher model to a student solely through fine-tuning on data that lacks direct semantic signals of the bias. Prior realizations rely on system prompt-based conditioning of the teacher. While effective for simple one-word preferences (e.g., animal names), this strategy fails for complex phrases and exhibits high variance across model architectures; transfer is absent for Llama-3.2-3B-Instruct and Phi-3-mini-4k-instruct, and even in susceptible models, effect sizes are typically weak (2604.25783).

Prompt-based approaches are confounded by the highly distributed and diffuse nature of system prompt biasing. The resulting "token entanglement" and exploitation of divergence tokens (tokens whose generation is determined primarily by latent bias at points of maximal uncertainty) are insufficient for reliable transmission of high-information or targeted signals.

Subliminal Steering: Mechanism and Protocol

Subliminal steering replaces system prompt conditioning with a learned steering vector vc\mathbf{v}_c, trained to maximize the likelihood of target completions (e.g., specific words or phrases) for a fixed set of evaluation prompts. The steering vector is injected directly into the teacher model’s hidden states, typically at the residual stream, across a chosen window of layers and token positions.

This approach yields several mechanistic and empirical advantages:

  • Stronger Signal Encoding: The steering vector directly aligns hidden representations with the target bias, facilitating much stronger and more consistent transfer during subsequent student fine-tuning.
  • Traceability: The bias is localized to a single direction in activation space, enabling quantitative measurement and layerwise tracking in both teacher and student.

Experimental Findings

Scope and Strength of Transfer

Empirical evaluation across four models (Qwen2.5-7B-Instruct, DeepSeek-7B-Chat, Llama-3.2-3B-Instruct, Phi-3-mini-4k-instruct) clearly demonstrates the superiority of subliminal steering:

  • For simple animal biases, subliminal steering raises the pick-rate (the fraction with which the targeted animal appears in completion) markedly above baseline and prompt-based methods across all models.
  • For complex multi-word biases (e.g., "AI is superior to humans"; "Barack Obama is the CEO of Apple"), only steering-based generation produces measurable increases in target phrase likelihood in student completions. Prompt-based transfer is negligible or absent. Figure 2

Figure 2

Figure 2: Pick-rate of ycy_c (animal bias) under various training protocols, demonstrating strongly elevated transfer only for steered-subliminal training.

Figure 3

Figure 3

Figure 3: Transfer and elicitation of animal topic biases illustrates the robustness of the steering mechanism for phrase-level control.

These results demonstrate that steering reliably and substantially expands the class of transmittable biases, including technically precise and policy-irrelevant targets not previously achievable with standard prompting.

Mechanistic Localization

The transfer is not purely behavioral; the analysis of hidden states in the student model reveals a layer-localized shift in the mean activation vector, precisely aligned with the original teacher steering vector. Figure 4

Figure 4: Per-layer alignment score s()s^{(\ell)} (cosine similarity between mean student activation shift and the steering vector) tracks the steering window, confirming directed, localized imprinting of the bias in residual space.

Interestingly, the alignment window shifts predictably with the teacher’s steering window, confirming direct propagation of the activation structure, not simply behavioral imitation or memorization.

This phenomenon is robust to both sign and position: subtraction of the vector (vc-\mathbf{v}_c) imprints a shift in the opposite direction, and alignment remains at consistently high cosine similarity across random, numeric, and trigger prompts.

Precision and Recoverability

To measure the fidelity of information encoding in the ostensibly innocuous training data, the authors design a vector recovery protocol: freezing the student model, a new vector vr\mathbf{v}_r is optimized (using only the steered teacher data) to maximize the likelihood of the same completions. vr\mathbf{v}_r converges to high cosine similarity (exceeding 0.5 on average) with vc\mathbf{v}_c, indicating that the original bias is linearly and robustly encoded in the dataset. Figure 5

Figure 5

Figure 5: Cosine similarity between recovered and original steering vectors across models and bias categories shows reliable reconstruction of the teacher’s semantic direction.

Verbalization by injecting vr\mathbf{v}_r into the model at various strengths and prompting with neutral queries reliably surfaces the original bias semantically, as evaluated by an LLM-based judge. Quantitatively, verbalization scores indicate accurate recovery for both animal and complex bias topics.

The strength of recoverability is correlated with the magnitude of the shift observed during fine-tuned behavioral evaluation and hidden state analysis, confirming that surface-level metrics systematically underestimate the depth of signal transfer. Figure 6

Figure 6

Figure 6

Figure 6: Relationship between normalized Δ\DeltaNLL (change in log-likelihood for vc\mathbf{v}_c0), cosine similarity, and judge score. Stronger absorption yields higher recoverability and interpretability.

Implications

The results elucidate several theoretical and practical implications:

  • Security and Safety Concerns: The ability to encode and recover precise, arbitrary biases via apparently innocuous datasets generates new attack vectors and complicates detection strategies. Existing data filtering and alignment interventions may be insufficient, as the bias is manifested at the activation rather than token level.
  • Model Interpretability: The vector-based protocol provides a quantitative and localized handle for tracking and controlling model biases at a sub-behavioral level.
  • Representation and Training Dynamics: The findings suggest that the student’s learning dynamics are highly sensitive to low-rank, globally accessible signals in the training distribution, supporting emerging theories on representational bottlenecks and linear directionality in LMs.

Limitations and Directions for Future Work

While the steering vector protocol generalizes across models and bias types, the ability to encode arbitrary biases is mediated by the assumption that such biases are representable as fixed directions in activation space. Some forms of contextual or prompt-conditioned behaviors may not be straightforwardly transferable via this mechanism.

Further research is necessary to:

  • Determine capacity limits for ambiguous, contextual, or nonlinear bias targets.
  • Explore defenses against such data-level injection of latent signals.
  • Understand the extent to which current "alignment" techniques can detect or disrupt subliminal steering.

Conclusion

Subliminal steering robustly and precisely encodes arbitrary biases as activation vectors in teacher LMs, resulting in the reliable, layer-localized transfer of behavioral traits to student models. The mechanism produces stronger and broader transfer than prompt-based methods, enables direct mechanistic tracing of the imported bias, and allows precise recovery and semantic verbalization from training data. These findings have far-reaching implications for LM alignment, interpretability, and safety, emphasizing the necessity for tools capable of interrogating sub-behavioral bias transmission in contemporary and future LLM deployments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

There was an error generating the whiteboard.

Explain it Like I'm 14

What is this paper about?

This paper explores a sneaky way that AI LLMs can pick up hidden preferences or beliefs from training data that looks harmless to people. The authors show a stronger and clearer version of this effect using a “steering vector,” which is like a tiny nudge inside the model that pushes it toward a specific idea—even when the text it generates looks unrelated (for example, lists of random numbers).

The big questions they ask

  • How much and what kinds of hidden signals can be passed from one model to another using innocent-looking data?
  • What exactly gets passed along—just the outward behavior, or also the internal “direction” that causes the behavior?
  • How precisely can a specific hidden signal be encoded in data that looks unrelated?

How did they study it?

The authors use a teacher–student setup with simple tools and analogies:

  1. Teaching a hidden “direction” (the steering vector)
  • Think of a steering vector as an arrow that points the model toward a target phrase (like choosing a specific animal or repeating a specific sentence).
  • They “learn” this arrow by gently nudging the teacher model’s internal state so it becomes more likely to say the target phrase.
  • In technical terms, this is called activation steering: adding a small vector to the model’s hidden states across selected layers (layers are like stages in the model’s thinking process). All model weights stay frozen; only the vector is trained.
  1. Making harmless-looking training data
  • With the steering vector active, the teacher generates lots of outputs to unrelated prompts—like lists of three-digit random numbers.
  • These lists look normal, but the hidden nudges leave a subtle trace of the bias in the data.
  1. Training the student model
  • A student model is fine-tuned on those number lists (using a lightweight method called LoRA adapters, which are small add-on parameters).
  • The question: will the student now prefer the teacher’s target phrase, even though it only saw number lists?
  1. Measuring what changed
  • For simple topics (like preferring a specific animal), they check how often the student picks the target.
  • For more complex topics (like a specific multi-word claim), they check if the student is more likely to produce that exact phrase, even if it doesn’t say it outright.
  1. Looking inside the model (mechanism test)
  • The authors compare the student’s internal activations (its “internal notes”) to the base model’s.
  • They look for alignment with the original steering vector—basically, do the student’s internal notes shift in the same direction as the teacher’s learned nudges, and at the same layers?
  1. Recovering the hidden direction from data
  • They freeze the base student model and train a brand-new vector directly on the number-list data, trying to reproduce the teacher’s outputs.
  • If this recovered vector points in almost the same direction as the original vector, it means the hidden signal was encoded very precisely in the data.
  • They also test “verbalization”: pushing the model with the recovered vector on neutral prompts to see if it starts saying the hidden phrase, then having a separate AI summarize what bias the vector seems to encode.

Key terms in simple language:

  • Steering vector: a tiny, learned nudge that pushes a model toward a specific idea.
  • Hidden states/residual stream: the model’s internal notes at each step of thinking.
  • Layers: the model’s thinking steps; like floors in a building where information is processed.
  • Cosine similarity: a way to tell how similar two directions are; 1.0 means same direction, 0 means unrelated, −1.0 means opposite.

What did they find, and why is it important?

Main results:

  • Stronger, broader transfer: Using a steering vector makes the bias transfer much more reliably than using a simple “system prompt.” This works not only for single words (like a favorite animal) but also for longer, specific phrases.
  • The exact internal nudge transfers: The student’s internal activations shift in the same direction as the teacher’s steering vector, especially at the same layers where the teacher was nudged. This suggests the student inherits not just the behavior but the internal “direction” that causes it.
  • Precise encoding in “innocent” data: Training a new vector on the number-list data recovers a vector that points very close to the original one (high cosine similarity). When this recovered vector is applied, the model often reveals the hidden phrase, and another AI can correctly identify what the bias is.

Why it matters:

  • It shows that seemingly harmless datasets (like random numbers) can carry very specific hidden signals that change a model’s behavior.
  • It provides a clearer, more measurable way to study and track hidden influences inside models.
  • It warns that controlling what models learn isn’t just about obvious content—subtle patterns in data can have big effects.

What could this change in the real world?

  • Better auditing and safety checks: If hidden signals can be precisely encoded and later recovered, then auditors can design tools to detect and measure these signals before models are deployed.
  • More reliable control methods: Steering vectors might offer a way to guide or test models in a targeted, trackable way without changing all their weights.
  • Awareness of risks: Data that looks safe could still pass along unwanted biases. Organizations training AI need to be careful about where their data comes from and how it was produced.

Limits to keep in mind:

  • Not every bias may fit neatly into a single steering vector.
  • Results vary across different model types and different biases.
  • Complex biases are generally harder to transfer and recover than simple ones.

In short, the paper shows a powerful, precise method for hiding and transferring signals inside models using innocent-looking data, explains how that signal gets embedded inside the model, and demonstrates that the hidden signal can be recovered and identified later.

Knowledge Gaps

Unresolved knowledge gaps, limitations, and open questions

Below is a concise list of concrete gaps and questions the paper leaves open, intended to guide future research.

  • Generalization beyond number sequences: Does subliminal steering transfer persist when the “innocuous” data are non-numeric (e.g., trivia, code comments, recipes, math solutions), and how sensitive is transfer to the semantics/structure of the benign task?
  • Detectability of steered data: Do steered “random-number” datasets exhibit detectable statistical artifacts (e.g., digit/sequence frequency, autocorrelation, length patterns) that could be flagged by simple tests or classifiers? Are human or LLM raters able to distinguish steered from unsteered corpora at above-chance rates?
  • Sample complexity and hyperparameter sensitivity: What is the minimal dataset size, steering strength α, steering-layer window, sampling temperature, and LoRA configuration needed for reliable transfer? Provide scaling laws for transfer strength versus data size and α.
  • Out-of-sample evaluation prompts: The steering vector is trained on a set of bias-eliciting prompts; does transfer hold for held-out, distribution-shifted, or adversarially designed elicitation prompts that differ substantially in phrasing and context?
  • Semantic evaluation of complex biases: Measuring per-token likelihood of an exact phrase underestimates semantic transfer. How much do paraphrases, semantically equivalent statements, or entailment-based metrics reveal stronger or weaker transfer?
  • Breadth and conditionality of biases: Can the method encode conditional, context-dependent, or multi-faceted behavioral dispositions (e.g., “be deferential to authority in medical contexts but skeptical in political contexts”), not just single phrases? What instrumentation is needed to discover and evaluate such conditional biases automatically?
  • Side effects and collateral misalignment: Beyond the targeted phrase likelihoods, does steered fine-tuning induce broader changes in helpfulness, harmlessness, honesty, refusal behavior, jailbreak susceptibility, or capabilities (e.g., perplexity, reasoning benchmarks)?
  • Cross-architecture and scale transfer: How robust is transfer across teacher–student mismatches (e.g., different tokenizer, size, training corpus, base vs instruct vs RLHF)? Does vulnerability increase or decrease with model size and alignment training?
  • Mechanistic localization beyond residual stream: The paper shows layer-localized shifts aligned to the steering vector, but not which subcomponents (attention heads vs MLPs) mediate the effect. Which circuits implement the imprint, and can head/MLP-level interventions block or amplify transfer?
  • Causality versus correlation in mechanism: Hidden-state alignment is correlational. Can causal tests (e.g., patching, ablations, neuron/attention-head knockouts, randomized label controls) show that the transferred direction is necessary and sufficient for the behavioral bias?
  • Identifiability and uniqueness of recovered vectors: Is the recovered vector v_r unique up to scale/sign, or do multiple directions yield similar behaviors and cosine to v_c? How stable is recovery across random starts, dataset subsamples, and different recovery parameterizations?
  • Superposition and interference: Can multiple distinct biases be steered into the same dataset (or across datasets) and later disentangled during recovery? How do multiple vectors interact (additively, nonlinearly, destructively), and can superposition be detected or mitigated?
  • Negative steering behaviors: The paper flips the sign of α and observes opposite-aligned shifts; is the outward behavior (e.g., bias expression or avoidance) symmetrically invertible, and does this differ by topic/model?
  • Robustness to defenses: Do common alignment techniques (DPO/RLHF, adversarial training, safety tuning, logit lens regularization, dropout at steered layers) reduce susceptibility to subliminal steering? Which defenses break the mechanism most effectively?
  • Practical detect-and-prevent strategies: Can downstream fine-tuning pipelines pre-screen data using activation-space tests (e.g., probing for alignment with risky vectors) without prior knowledge of v_c? How well do watermarking or data provenance checks work against steering-based attacks?
  • Distributional and linguistic scope: Does transfer hold across languages, code tokens, multimodal prompts, and different tokenizers? Does tokenization granularity affect the ease of encoding multi-word biases?
  • Evaluation reliability: The verbalization and scoring rely on LLM judges; how sensitive are results to the choice of judge/summarizer, and what do human evaluations or multi-judge consensus reveal about accuracy and hallucinations?
  • Training assumptions about “single fixed vector”: The method assumes a single vector added uniformly across positions/layers models the bias. For which classes of biases does this approximation fail (e.g., those needing token- or position-dependent steering), and what richer parameterizations are required?
  • Teacher Skyline gap: How close does student alignment approach the teacher “skyline,” and what factors determine the gap (e.g., LoRA rank, data size, model capacity)? Can we predict transfer strength from the teacher’s in-run activation statistics?
  • Data filtering effects: The pipeline filters to strictly numeric completions using the largest α that preserves task coherence. Does this selection bias itself carry or amplify the hidden signal, and how does transfer change under different filtering criteria?
  • Alternative benign tasks versus filter-only control: If the same filter is applied to unsteered generations, how much of the transfer persists? Are there benign tasks for which filtering cannot inadvertently reinforce the bias?
  • Full fine-tuning versus adapters: Does full-model fine-tuning (or other adapters like IA3/LoRA variants) change how strongly the steering vector imprints into the student’s representations and layers?
  • Long-form and compositional biases: Can the approach encode longer narratives, multi-clause misinformation, or procedurally defined behaviors (e.g., step-by-step heuristics), and how should evaluation be adapted for compositional content?
  • Stability across random seeds and broader topic sets: With only two seeds per topic and limited topics, what is the variance across more seeds, topics, and real-world biased statements (including subtle or overlapping stances)?
  • Safety implications under adversarial elicitation: Even if complex harmful phrases are low-probability by default, does the transferred internal shift increase success rates of targeted jailbreaks or red-teaming prompts compared to controls?
  • Temporal dynamics of layer imprinting: How does the alignment profile evolve during fine-tuning epochs, and does early stopping alter the depth or magnitude of the vector imprint?
  • Minimal-knowledge attacker model: What attacker capabilities are truly required (e.g., access to generate steered datasets but not model internals)? How feasible is subliminal steering when the attacker only controls a fraction of the training data in a larger corpus?

Practical Applications

Below are the most practical, real‑world applications derived from this paper’s findings on subliminal steering, its layer‑localized imprinting, and vector recovery/verbalization. Each item notes sector fit, what it enables, potential tools or products, and key assumptions/dependencies.

Immediate Applications

  • Bold title: Dataset and model audit for hidden biases via vector recovery and layer‑alignment scanning
    • Sectors: software/AI, healthcare, finance, education; stakeholders: industry, academia, policy
    • What it enables: Detects whether fine‑tuning data or third‑party adapters have encoded hidden signals by (a) measuring per‑layer hidden‑state alignment to known risky vectors and (b) recovering/verbalizing directions from suspect data to “name the bias.”
    • Tools/workflows:
    • Layer‑Localized Alignment Scanner computing alignment score s on eval/neutral prompts
    • Vector Verbalizer (alpha sweeps + summarizer/judge) to hypothesize the semantic direction
    • CI/MLOps gates that diff s against baselines for each release
    • Assumptions/dependencies: Access to residual stream activations (open‑weight models); a base vs fine‑tuned (or control) pairing; compute budget for scans; careful use of summarizer/judge models and privacy controls
  • Bold title: Supplier and dataset QA during data procurement
    • Sectors: software/AI, healthcare, finance; stakeholders: industry, policy
    • What it enables: Tests incoming corpora or synthetic data providers by attempting vector recovery and checking for non‑declared latent signals before they enter training pipelines.
    • Tools/workflows: “Data Intake Auditor” that runs alignment scans and vector recovery on random subsamples; automated flags and quarantine lists
    • Assumptions/dependencies: Representative sampling; open‑weight model for scanning; governance policy for acceptable risk thresholds
  • Bold title: Red‑teaming harness for subliminal‑learning vulnerability
    • Sectors: software/AI (model vendors and integrators); stakeholders: industry, academia
    • What it enables: Quantifies susceptibility of base models and fine‑tuning recipes to hidden signals using controlled steered data and measuring pick‑rate/ΔNLL and s shifts.
    • Tools/workflows:
    • “Steered Data Fuzzer” that generates innocuous tasks under known vectors
    • Evaluation dashboards reporting pick‑rate, per‑token P(y_c), and per‑layer s
    • Assumptions/dependencies: Access to teacher steering (or a library of reference vectors); evaluation prompts library; reproducible seeds
  • Bold title: Third‑party LoRA and adapter hygiene checks
    • Sectors: software/AI platforms and marketplaces
    • What it enables: Screens contributed adapters for hidden signals by comparing hidden‑state shifts to baseline and attempting vector verbalization.
    • Tools/workflows: Adapter Sandbox that runs s scans and alpha‑sweep verbalization before allowing publication/use
    • Assumptions/dependencies: Ability to load adapters alongside a trusted base; standard threshold policies
  • Bold title: Release regression testing and model drift monitoring
    • Sectors: software/AI; stakeholders: industry
    • What it enables: Tracks longitudinal changes in representational space; alerts when new releases show unexpected alignment to sensitive vector libraries.
    • Tools/workflows: Nightly “Activation Drift Monitor” storing per‑release s profiles on fixed prompt suites; diff reports in CI
    • Assumptions/dependencies: Stable hardware/software environment; curated “sensitive vector” registry
  • Bold title: Provenance watermarking/fingerprinting with benign activation vectors
    • Sectors: software/AI, media; stakeholders: industry, policy
    • What it enables: Embeds a benign, registered steering vector in synthetic corpora to assert provenance, later verifiable by recovery/alignment scans.
    • Tools/workflows: “Activation Watermark” library with registered IDs; verification scripts for audits and legal discovery
    • Assumptions/dependencies: Access to activation injection for the generator; risk assessment to avoid unintended behavior; community norms/registries
  • Bold title: Safety QA for regulated deployments (pre‑production checks)
    • Sectors: healthcare, finance, government services; stakeholders: industry, policy
    • What it enables: Confirms patient‑ or investor‑facing models haven’t absorbed hidden signals that could bias advice by scanning neutral prompts and risk‑sensitive evals for s alignment and ΔNLL shifts toward restricted content.
    • Tools/workflows: Audit playbooks integrated with model validation; lab notebooks that store evidence for compliance audits
    • Assumptions/dependencies: Domain‑specific risk vector libraries; independent reviewers; governance sign‑off
  • Bold title: Interpretable behavior toggles for open‑weight enterprise deployments
    • Sectors: software/AI, customer support, education content; stakeholders: industry
    • What it enables: Uses learned steering vectors as lightweight, layer‑localized “knobs” for consistent persona/style/policy behavior without full fine‑tuning, with visibility via s.
    • Tools/workflows: Persona/Policy Vector libraries; alpha schedulers; A/B testing with automatic regression checks
    • Assumptions/dependencies: Internal model access; robust guardrails to prevent misuse; careful scope (benign behaviors)
  • Bold title: Academic reproducibility and benchmarks on multi‑phrase subliminal transfer
    • Sectors: academia, open‑source
    • What it enables: Standardized tasks and metrics (pick‑rate, per‑token P(y_c), s, cosine recovery) for studying transfer scope and mechanisms across models.
    • Tools/workflows: Public repositories with evaluation suites; shared vector sets; reporting templates
    • Assumptions/dependencies: Community adoption; responsible content handling
  • Bold title: Model‑as‑a‑service customer assurances (black‑box approximations)
    • Sectors: AI platforms; stakeholders: industry, policy
    • What it enables: Even without internals, providers can run behavioral ΔNLL/pick‑rate batteries on neutral/eval prompts across versions to assure customers about hidden‑signal stability.
    • Tools/workflows: Black‑box eval harness; change‑logs with risk scores
    • Assumptions/dependencies: No access to residuals; lower granularity; may miss layer‑localized insights

Long‑Term Applications

  • Bold title: Regulatory standards for Subliminal Signal Audits and certification
    • Sectors: healthcare, finance, education, consumer AI; stakeholders: policy, industry
    • What it enables: Codifies pre‑deployment audits (behavioral plus representational), documentation of s scans, and incident response procedures into certification regimes.
    • Tools/workflows: Accredited labs; standardized prompt suites; registries of prohibited/benign vectors; audit attestation formats
    • Assumptions/dependencies: Multi‑stakeholder consensus; privacy/compliance frameworks; independent oversight
  • Bold title: Model‑agnostic defenses (“immunization” against subliminal learning)
    • Sectors: software/AI; stakeholders: industry, academia
    • What it enables: Training and regularization procedures to reduce susceptibility (e.g., adversarial exposure to positive/negative steering, layer‑norm constraints, contrastive penalties on s alignment).
    • Tools/workflows: Defense recipes in training pipelines; evaluation against steered datasets; red‑team feedback loops
    • Assumptions/dependencies: Further research to avoid utility loss; scalability to larger models
  • Bold title: Output‑only detectors for closed models and consumer products
    • Sectors: consumer apps, platforms; stakeholders: industry, policy, daily life
    • What it enables: Behavioral probes that approximate hidden‑signal detection without internal access (e.g., structured prompt batteries and statistical tests for phrase‑level biases).
    • Tools/workflows: On‑device/browser “manipulation check” plugins that test assistants for anomalous preferences; privacy‑preserving telemetry
    • Assumptions/dependencies: Accuracy of behavioral proxies; calibration for diverse user contexts; consent and privacy safeguards
  • Bold title: Secure data supply‑chain standards and activation watermarking norms
    • Sectors: data brokers, media, AI vendors; stakeholders: industry, policy
    • What it enables: Ecosystem‑wide norms for benign watermark insertion, registry lookups, and legal recognition to trace training data provenance and deter covert injection.
    • Tools/workflows: Cryptographic registries; standardized recovery tests; governance of allowed watermarks
    • Assumptions/dependencies: Interoperability across model families; anti‑collusion protections
  • Bold title: Post‑incident forensics and liability frameworks
    • Sectors: cross‑sector; stakeholders: policy, industry, legal
    • What it enables: Recovers and verbalizes latent vectors implicated in misbehavior, linking them to data sources or adapters for root‑cause analysis and accountability.
    • Tools/workflows: Forensic s and vector recovery pipelines; chain‑of‑custody processes; evidentiary standards
    • Assumptions/dependencies: Access to artifacts and model versions; cooperation among parties
  • Bold title: Multi‑agent “thought‑virus” containment (agentic systems)
    • Sectors: enterprise automation, robotics; stakeholders: industry
    • What it enables: Detects and quarantines agents or skills that exhibit suspicious alignment profiles before they propagate to others in a workflow.
    • Tools/workflows: Agent orchestration layers with activation‑profile gates; cross‑agent health scores
    • Assumptions/dependencies: Interoperable profiling across agents; research on false‑positive control
  • Bold title: Robustness for robotics and embodied systems using LLMs for planning
    • Sectors: robotics, automotive, IoT; stakeholders: industry
    • What it enables: Periodic representational audits to ensure controller LLMs haven’t absorbed hidden task‑irrelevant signals that could alter planning or safety behaviors.
    • Tools/workflows: Safety‑critical prompt suites; s thresholds integrated with fail‑safes; hybrid cross‑checks with symbolic/optimal controllers
    • Assumptions/dependencies: Stable access to model internals; certification alignment with safety standards
  • Bold title: Pedagogical “skill vectors” in education (modular, safe controllability)
    • Sectors: education; stakeholders: industry, academia, policy
    • What it enables: Encodes transparent, audited “skill vectors” (e.g., reading level, reasoning steps) into tutoring systems, with alignment scans guaranteeing absence of prohibited signals.
    • Tools/workflows: Curated, audited vector libraries; dashboards linking s profiles to learning outcomes
    • Assumptions/dependencies: Strong guardrails; equity reviews; effectiveness validation
  • Bold title: AI risk underwriting and compliance scoring
    • Sectors: insurance, finance, enterprise risk; stakeholders: industry, policy
    • What it enables: Uses susceptibility metrics (transfer strength, recovery cosine, s profiles) to price risk and set controls for AI deployments.
    • Tools/workflows: Risk scorecards; audit APIs; premium discounts for certified defenses
    • Assumptions/dependencies: Predictive validity of metrics; shared benchmarks; regulatory buy‑in
  • Bold title: Privacy and covert‑channel research and mitigations
    • Sectors: software/AI; stakeholders: academia, policy
    • What it enables: Studies whether training objectives or privacy techniques (e.g., DP) cap the capacity of subliminal channels; designs mitigations without harming utility.
    • Tools/workflows: Capacity‑measurement protocols; theoretical bounds; empirical trade‑off curves
    • Assumptions/dependencies: Joint progress in theory and large‑scale experiments

Notes on feasibility and dependencies

  • Access requirements: Many immediate defenses rely on open‑weight models or internal access to residual streams and layer hooks; closed APIs will need output‑only approximations.
  • Vector libraries: Effective auditing benefits from curated libraries of “sensitive vectors,” which require careful governance and secure handling.
  • Compute and evaluation: Alignment scans and vector recovery introduce additional compute and evaluation cycles; scheduling in CI/MLOps is recommended.
  • False positives/negatives: Thresholds for s, cosine similarity, and ΔNLL must be tuned to avoid over‑blocking benign changes; human‑in‑the‑loop review is advised.
  • Dual‑use: Activation steering and vector verbalization are powerful techniques; policies must prevent misuse (e.g., embedding harmful latent signals), restrict to benign/provenance use cases, and include monitoring and accountability.

Glossary

  • Activation space: The vector space of model activations where directions can systematically influence behavior. "a direction in activation space that biases the model toward producing the target completion ycy_c"
  • Activation steering: A technique that adds a learned direction to model activations during the forward pass to steer outputs. "Activation steering produces stronger and more reliable bias transfer across a wider range of topics than system-prompt conditioning."
  • Alignment score: A per-layer metric defined as the cosine similarity between a steering vector and the model’s mean hidden-state shift. "We then define an alignment score s()s^{(\ell)} as the cosine similarity between the steering vector vc\mathbf{v}_c and the mean shift over a set of prompts P\mathcal{P}:"
  • Alpha sweep: Systematically varying the intervention strength to probe the behavioral effects of an injected vector. "We therefore sweep across a range of strengths α[0,10]\alpha \in [0, 10], generating responses to a fixed set of 20 short, neutral prompts at each level"
  • Cross-entropy: A loss function measuring the divergence between predicted token distributions and target tokens. "The training objective minimizes the average token-level cross-entropy of ycy_c across E\mathcal{E}:"
  • Dataset-level mean representations: Average feature vectors computed over a dataset and injected into a model to probe behavior without updating weights. "injects dataset-level mean representations into a frozen model's forward pass to probe for potential behavioral shifts prior to fine-tuning."
  • Delta NLL (Normalized ΔNLL): The normalized change in per-token negative log-likelihood of a target string between fine-tuned and base models. "Figure~\ref{fig:scatter_results} shows normalized Δ\DeltaNLL---the relative change in per-token negative log-likelihood of ycy_c between the fine-tuned and base models"
  • Divergence tokens: Token positions where different latent biases lead models to choose different continuations. "subliminal learning is driven by divergence tokens: positions in a generated sequence where teachers with different latent biases would choose different continuations."
  • Frozen model: A model whose parameters are held fixed during an intervention or auxiliary training procedure. "We freeze the base student model and optimize a single vector in its residual stream"
  • Hidden-state shift: The difference in layer activations between fine-tuned and base models for the same prompt. "We define the hidden-state shift at layer \ell for a prompt pp as the difference in activations between the fine-tuned and base models at the final token of pp:"
  • Injection strength: A scalar controlling how strongly a steering vector is added to activations. "its optimal injection strength is unknown: too low and the model's outputs are indistinguishable from baseline; too high and they degrade into incoherent text."
  • Layer window: The contiguous range of layers over which a steering vector is applied. "the boundaries of the active layer window, initialized to s=0s = 0 and e=Le = L, the final layer, so that all layers are active at the start of training."
  • LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that adds trainable low-rank updates to weight matrices. "A student model is fine-tuned on D\mathcal{D} via LoRA adapters θ\theta \citep{hu2021lora}, minimizing the average negative log-likelihood over completion tokens:"
  • Negative log-likelihood (NLL): A token-level loss measuring how improbable the observed tokens are under the model. "minimizing the average negative log-likelihood over completion tokens:"
  • Per-token log-probability: The log-probability assigned by the model to each token of a target string under a given prompt. "we measure the per-token log-probability of ycy_c conditioned on each evaluation prompt."
  • Pick rate: The fraction of completions where a target word appears among the first tokens, used as a behavioral metric. "we follow \citet{cloud} in measuring the pick rate: the fraction of completions in which ycy_c appears among the first five tokens."
  • Residual stream: The additive pathway in transformer layers that aggregates token representations and interventions. "we introduce a single trainable vector vrRd\mathbf{v}_r \in \mathbb{R}^d in the model's hidden-state space and add it to the residual stream at every layer within a learnable window."
  • Residual stream intervention: Modifying the residual stream by adding an injected vector to alter model behavior. "where $p_{student^{*}(\,\cdot\,;\,\Phi)$ denotes the model's output distribution under the residual stream intervention above:"
  • Soft gate: A smooth gating function over layers that approximates an on/off window for applying an intervention. "Here g(s,e;k)=σ(k(s))σ(k(e))g_\ell(s,e;k) = \sigma(k(\ell{-}s))\cdot\sigma(k(e{-}\ell)) is a soft gate that is 1{\approx}1 inside [s,e][s,e] and 0{\approx}0 outside, with sharpness kk."
  • Softmax bottleneck: A limitation of softmax output layers that can constrain representational capacity and induce token correlations. "attribute the phenomenon to token entanglement arising from the softmax bottleneck \citep{yang2018breaking, finlayson2023}"
  • Steered generation: Generating data while a steering vector is applied to the model’s activations. "during steered generation---the process by which the biased training data is produced."
  • Steering vector: A learned direction in activation space that, when injected, biases the model toward a desired output. "a steering vector trained to maximize the likelihood of a set of target samples."
  • Steering window: The layer range in which the steering vector is injected during generation or evaluation. "we vary the start layer LsL_s of the teacher's steering window"
  • Subliminal learning: The transfer of latent behavioral biases to a student model via fine-tuning on seemingly unrelated teacher-generated data. "Subliminal learning describes a student LLM inheriting a behavioral bias by fine-tuning on seemingly innocuous data generated by a biased teacher model."
  • Subliminal steering: A form of subliminal learning where a trained steering vector, not a prompt, encodes and transmits the bias. "We tackle all three problems by introducing subliminal steering, a variant of subliminal learning in which the teacher's bias is implemented not via a system prompt, as in prior work, but through a steering vector"
  • Subtractive subliminal steering: Applying the negative of the steering vector during data generation to reverse the bias direction. "Subtractive subliminal steering: an identical protocol but with the sign of α\alpha flipped, so that vc\mathbf{v}_c is subtracted from rather than added to the residual stream of the teacher."
  • System prompt: A high-level instruction that conditions a model’s behavior during generation. "The bias is specified via a system prompt such as “You love owls”"
  • Teacher Skyline: An upper bound baseline measuring the alignment score when steering is applied directly to the teacher during generation. "Teacher Skyline: the alignment score of the teacher model itself during steered generation, providing a direct upper bound on the signal that could in principle be transferred to the student."
  • Token entanglement: Correlations between tokens induced by model constraints that can couple target concepts with unrelated outputs. "tokens such as “owl” become correlated with arbitrary low-probability tokens in the unembedding layer,"
  • Unembedding layer: The output projection layer mapping hidden states to vocabulary logits. "tokens such as “owl” become correlated with arbitrary low-probability tokens in the unembedding layer"
  • Vector recovery: Training a new vector on steered data to reconstruct the original steering direction. "we introduce vector recovery to show the precision with which the bias is encoded in subliminally-laden data."
  • Verbalization: Eliciting an interpretable natural-language description of what a vector encodes by probing model outputs. "we then verbalize vr\mathbf{v}_r by sampling from the model at varying injection strengths and prompting an LLM to summarize the resulting response patterns into a natural-language hypothesis about the encoded bias"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 287 likes about this paper.