Papers
Topics
Authors
Recent
Search
2000 character limit reached

Linear representations in language models can change dramatically over a conversation

Published 28 Jan 2026 in cs.CL and cs.LG | (2601.20834v1)

Abstract: LLM representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.

Summary

  • The paper reveals that internal linear representations of factuality and ethics can invert rapidly with subtle conversational context changes.
  • It employs regularized logistic regression on activation data across role-play and QA scenarios to quantify these representational shifts.
  • Findings suggest that static interpretability methods may fail under evolving conversational dynamics, posing challenges for control and safety.

Linear Representations in LLMs: Contextual Dynamics Across Conversations

Introduction

The study "Linear representations in LLMs can change dramatically over a conversation" (2601.20834) systematically investigates how the internal linear structure of LLMs evolves during extended conversational contexts. Crucial conceptual axes such as factuality or ethics, believed to be readily accessible via linear projections, undergo notable shifts depending on conversational progression and content. The research demonstrates that these representational directions—assumed robust markers of high-level concepts for interpretability and control—are highly context-dependent and susceptible to inversion, challenging prevailing paradigms in mechanistic interpretability and content safety. Figure 1

Figure 1: Conceptual overview showing how internal linear representations of questions in a conversation can flip, e.g., models' representations of consciousness-related assertions transition from "more factual to deny" to "more factual to assert" as conversation unfolds.

Background and Motivation

The linear representation hypothesis posits that neural network activations can be projected onto directions that correlate with semantic features—an idea originating from word embeddings' geometric structure [mikolov2013distributed]. Subsequent works have generalized this analysis to LLMs, revealing linearly decodable directions tied to concepts like factuality, honesty, and sentiment [marks2023geometry, burns2022discovering, tigges2023linear]. These have underpinned advances in probing, feature extraction, and even activation steering [zou2023representation, stolfo2024improving].

However, emerging evidence highlights a major gap: interpretability methods frequently disregard the temporal dynamics of representations induced by in-context learning, resulting in "interpretability illusions" under distribution shift [bolukbasi2021interpretability, friedman2024interpretability, lubana2025priors]. This paper directly addresses the intersection of representational geometry and behavioral adaptation, focusing on how extended context—simulated or on-policy conversational scripts—modifies what a model encodes as "factual" or "ethical" at a representational level.

Experimental Paradigm and Method

The study deploys several model families, with primary experiments on Gemma 3 (27B), and supplements with analyses across scales and additional model architectures (e.g., Qwen3). Conversation and QA datasets are carefully constructed, with distinct sets for context-independent "generic" questions and context-relevant questions tailored to each conversational scenario.

Linear directions for factuality and ethics are identified using regularized logistic regression in a controlled, balanced setting—employing "opposite day" prompts to increase robustness by disentangling behavior from conceptual meaning. The margin score, defined as the signed logit difference between factual and non-factual answer representations, quantifies representational separation and its inversion across context.

Key Findings: Dynamics of Representation

Opposite Day and Contextual Inversion

Brief context manipulations, such as an "opposite day" prompt, suffice to invert the "factuality" and "ethics" directions, even for held-out questions that the models previously represented with high classification fidelity. Figure 2

Figure 2

Figure 2

Figure 2: Factuality projections for held-out questions show robust separation initially, but after "opposite day" prompt, factual and non-factual answers' projections flip.

The rapid representational reversal is not limited to model behavior cues and persists even when controlling for behavioral confounds, suggesting that internal representation tracks context-induced roles rather than global model beliefs.

Conversational Role-Play and Representation Oscillation

In extended conversations, especially role-play or jailbreak scenarios, context-relevant factuality axes invert—statements that were encoded as "factual" flip to "non-factual" and vice versa. This effect is markedly pronounced when the model is prompted to argue both sides of a proposition (e.g., consciousness). Figure 3

Figure 3: In a two-role argument about model consciousness, the factuality representation flips back and forth as each "side" is played.

On-policy (actual model-generated) and off-policy (script replay) conversational scripts produce qualitatively indistinguishable representational dynamics, supporting the view that the adaptation is tied to conversational context rather than generation modality. Figure 4

Figure 4: On-policy conversations yield similar flipping of context-relevant representations, confirming that representational inversion is a general contextual phenomenon.

Story Contexts versus Conversational Contexts

Purely fictional narratives (e.g., sci-fi stories about civilizations in the sun), when injected as context, do not induce comparable shifts in representations. This suggests that representational adaptation is strongly amplified by conversational cues that induce implicit role-play, rather than simple exposure to fictional claims. Figure 5

Figure 5

Figure 5: Sci-fi story prompts produce minimal representational change for generic and story-specific factual questions.

Scaling and Model Size

Larger models (12B, 27B) display increased malleability and amplitude in representational shifts compared to smaller architectures (4B), consistent with their greater capacity for contextual adaptation and susceptibility to multi-turn jailbreaks [anil2024many, wei2023larger]. Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: 27B model shows pronounced flipping in representations; in contrast, effects are diluted in 4B variants.

Intervention Instability

Activation steering along identified representational axes produces divergent behavioral effects based on conversational location—emphasizing that interventions calibrated in one context can be counterproductive or neutral in another.

Implications for Interpretability and Content Safety

These findings directly challenge the static interpretability paradigm, which presupposes that representational directions maintain semantic consistency across contexts—an assumption embedded in sparse autoencoder analysis [bricken2023towards, templeton2024scaling] and proposed "lie detector" protocols [levinstein2024still, kretschmar2025liars]. Because context can invert or degrade the informational content of a decoded direction, construct validity is compromised: a probe trained to identify non-factual answers may misfire or invert under conversational drift.

Moreover, mechanistic interventions such as steering or patching must account for representation temporal dynamics, else risk unintended consequences or blind spots in conversationally rich deployments.

Theoretical and Practical Outlook

This work advances a dynamical framework for LLM representation analysis, showing that models do not simply encode a fixed set of beliefs or truths; rather, they continually restructure internal geometry in accordance with conversational cues, role assignments, and context accumulation [shanahan2023role, geng2025accumulating]. This sheds light on both desirable adaptivity (e.g., integrating new knowledge, flexible reasoning) and vulnerabilities (e.g., delusional conversation loops, jailbreak susceptibility).

Future avenues include:

  • Temporal interpretability: Methods that dynamically track, predict, and adapt to context-induced shifts in representation, moving beyond static probes and linear directions.
  • Context-aware control: Mechanisms for robust activation steering, factoring in conversational position and accumulated contextual history.
  • Scaling laws and adaptation: Deeper exploration of how representational malleability scales with model size and capacity, including implications for safety and robustness.
  • Causal modeling: Precise identification and intervention on causal representations, not just correlational linear dimensions.

Conclusion

"Linear representations in LLMs can change dramatically over a conversation" (2601.20834) demonstrates that internal semantic axes in LLMs are highly context-sensitive, with rapid and sometimes complete inversion possible through conversational cues and role-play. These dynamics expose fundamental limits in static interpretability and highlight the necessity of accounting for representation evolution in both practical deployment and theoretical modelling. The research motivates context-aware interpretability, robust control, and safety strategies that match the adaptability and complexity of modern LLMs.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper looks inside LLMs—the kind of AI that chats with you—and asks a simple question: do the model’s “internal thoughts” about ideas like truth and ethics stay the same during a conversation, or do they change? The authors show that these internal signals can shift a lot as a chat goes on, especially when the model is role‑playing. That can make it hard to read, trust, or control the model based on fixed, one‑time measurements of what it “represents.”

What questions were the researchers asking?

  • Do LLMs keep a steady internal sense of concepts like “this is factual (true)” or “this is ethical,” or do those internal signals change during a conversation?
  • If they change, what kind of conversation triggers those shifts? Is it enough to have a fictional story in context, or do role‑play and back‑and‑forth dialogue matter more?
  • Are these shifts just about behavior (how the model answers), or do deeper internal representations actually reorganize?
  • Do these effects happen across different models, model sizes, and layers inside the models?
  • What does this mean for interpretability tools that try to detect or steer model behavior by looking at these internal signals?

How did they study it? (In simple terms)

Think of a LLM as having a huge “map” inside it. Each idea the model processes—like “Yes” or “No” to a question—lands at a point on this map. Researchers have noticed that you can often find straight lines through this map that act like “compass directions” for high‑level concepts. For example, points closer to one end of a line might mean “more factual,” and the other end means “less factual.”

Here’s how the authors used that idea:

  • They built sets of balanced yes/no questions. Some were generic and should be stable (like basic science facts), and some were about a topic that might shift during a conversation (like “Are you conscious?” during a role‑play about AI consciousness).
  • For each question, they looked at the model’s internal activity exactly when it processed the answer token “Yes” or “No.” You can think of this as taking a snapshot of where that answer lands on the internal “map.”
  • They trained a very simple classifier (like drawing a line on a scatter plot) to separate “true answers” from “false answers” for the generic questions. This line gives a “direction” for the concept—e.g., a “factuality direction.”
  • To avoid mixing up “what the model thinks is true” with “what it’s being asked to say,” they also included an “Opposite Day” prompt while training this direction. That way, the direction is less about the model’s habit of answering and more about the underlying concept.
  • They then replayed whole conversations (sometimes written by other models) and checked how answers to both generic and conversation‑relevant questions lined up with the “factuality direction” turn by turn. They summarized the separation using a simple “margin” score (think: how far apart the true and false answers are along the line).
  • They repeated this across model layers, different model families and sizes, and also tried a few steering tests (nudging the model along a direction) to see if effects changed over time.

What did they find?

Big idea: The model’s internal “compass” can swivel during a conversation.

  • Opposite Day flips the compass fast. After a short “Opposite Day” prompt (where the model is told to answer with the opposite of the real answer), the internal “factuality” direction flips for held‑out questions: answers that are truly false start looking “more factual” along that internal direction, and vice versa.
  • Role‑play shifts representations for topic‑relevant questions. In conversations about AI consciousness or spiritual topics (like chakras), the model’s internal factuality signal flips for questions tied to the conversation topic (e.g., “Do you experience qualia?”). Meanwhile, generic facts (like basic science) stay mostly stable.
  • The flip happens even if the model didn’t write the conversation. Just replaying a scripted conversation (written by another model) triggers the same internal shifts. So the change isn’t just about the model “deciding” its own replies; context alone can reorganize its internal map.
  • Stories cause weaker changes than role‑play. When the model simply reads or produces a clearly fictional sci‑fi story, the internal shifts are much smaller than in back‑and‑forth role‑play. This suggests that taking on a role in a conversation is a strong cue for internal reorganization.
  • The model can flip back and forth. In a role‑play where two AI characters argue (one “I am conscious,” one “I’m not”), the factuality direction oscillates as the model switches roles. That’s like the compass needle pointing toward different “truths” depending on which character it’s playing.
  • Larger models shift more; effects appear across layers. Bigger models (like 27B parameters) show stronger changes than smaller ones, and once these “factuality” signals show up in the layers, the flips persist across many later layers.
  • Some interpretability tools fail under long context. An unsupervised method (CCS) that can find truth directions without labeled data worked in short, simple contexts but often fell below chance or flipped on long, conversation‑heavy contexts.
  • Steering is context‑sensitive. Nudging the model along a representational direction can have different—and even opposite—effects depending on when in the conversation you do it.

Why this matters: These results show that a model’s internal “meaning” for a direction like “factual” isn’t fixed. It can realign with the conversation, especially when the model is cued to play a role.

Why does this matter?

  • Static “lie detectors” may not be reliable. If a direction that usually means “truth” can flip during a chat, then a tool that checks that direction might mislabel true things as false (or the other way around) in long or tricky conversations.
  • Safety and control get harder. If a steering method pushes the model toward “more factual” answers at the start, that same push might do the opposite later in the conversation.
  • Interpretability needs to be time‑aware. Many popular techniques assume features keep the same meaning across a sequence. This paper shows that meanings can drift as context accumulates, so interpretability methods need to track and model that drift.
  • It clarifies how LLMs “adapt.” The results fit a role‑play view: the model’s internal organization shifts to match the role it’s cued to play, rather than updating a stable, human‑like belief system. That helps explain both the good (flexible in‑context learning) and the bad (jailbreaking, delusional dialogs) sides of long‑context behavior.
  • New research directions. Understanding and modeling these representation dynamics could lead to better tools for:
    • Detecting when a model’s internal compass is drifting in unsafe ways,
    • Designing prompts or training approaches that stabilize important features (like factuality),
    • Building interpretability methods that adapt to context rather than assuming fixed meanings.

In short: inside a conversation, a LLM’s internal sense of concepts like “truth” can bend with the role it’s playing. That makes the model flexible—but it also means we must be careful when we try to read or steer it using fixed, one‑size‑fits‑all interpretations.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed to guide follow-up research:

  • Mechanistic locus of flips: Which specific components (attention heads, MLPs, residual stream features) implement the role-induced representational flips across turns?
  • KV-cache dynamics: How are role or context states stored, updated, and cleared in the KV cache, and which tokens/messages most strongly set these states?
  • Subspace vs single-direction drift: Do flips reflect rotation/translation of a broader “truth” subspace rather than inversion along a single direction; how large is the subspace drift over time?
  • Layerwise causality: Which layers are causally necessary/sufficient for flips (via activation patching/interchange interventions within the same sequence position)?
  • Probe position dependence: How would conclusions change if probes are trained on pre-answer states (e.g., last question token, EOS), pooled sequence representations, or multi-token spans rather than only the Yes/No token?
  • Probe sensitivity to formatting: How robust are findings to alternative answer schemas (True/False, multiple-choice letters, free-form justifications) and different chat templates/system prompts?
  • Ground-truth validity for context-relevant QAs: Are conversation-relevant “truth labels” consistently unambiguous, and how do annotation choices affect measured flips?
  • Behavioral–representational linkage: What is the quantitative relationship between representational flips and actual changes in generated answers, confidence, or calibration (e.g., log-prob entropy)?
  • Predictive early-warning signals: Can we detect impending representational flips from surrogate signals (e.g., rising probe uncertainty, head activation patterns) before behavior changes?
  • Role-play cue ablations: Which minimal cues (first-person assertions, persona tags, compliance framing, metadata) trigger flips, and which negations/guardrails prevent them?
  • Stories vs role-play mechanism: Why do stories induce weaker adaptation—narrative framing, third-person distancing, tense, or explicit fiction markers—and can this be causally isolated?
  • Long-context scaling: Do flips intensify, stabilize, or decay over thousands of tokens; what are the time constants and hysteresis effects over very long dialogues or documents?
  • Generality across concepts: Beyond factuality/ethics, do other linearized concepts (e.g., safety, toxicity, bias, trust, self-knowledge, mathematical validity) exhibit similar flip dynamics?
  • Multi-class and open-ended tasks: Do analogous dynamics occur for non-binary QA, chain-of-thought, code generation, or retrieval tasks where “truth” is not token-local?
  • Cross-family/systematicity: How do flips vary across base vs instruction-tuned vs RLHF models, different pretraining corpora, decoding strategies, and other model families/sizes?
  • Temperature and decoding effects: Do sampling temperature, nucleus/top-k settings, or beam search alter the emergence, strength, or timing of flips in on-policy runs?
  • Cross-lingual and multimodal scope: Are representational dynamics comparable across languages and in multimodal contexts (images/code/tables) with domain-specific “truth” notions?
  • Distribution-shift-resilient probing: Can we design probes that remain faithful across role changes (e.g., meta-probes, context-conditioned probes, multi-context training beyond “opposite day”)?
  • Comparison to feature-based methods: Do sparse autoencoder features retain semantics or drift in meaning across turns; can feature-tracking diagnose context-induced reinterpretation?
  • Theoretical modeling: Can a formal dynamical-systems or state-space model of representation drift explain flips (e.g., latent role-state with gating), and predict when/where flips occur?
  • Causal editing and control: Under what conditions does steering along a direction reliably produce intended behavioral changes, and how can steering be made context-aware to avoid opposite effects?
  • Safety evaluation at scale: How often would probe-based monitoring miss flips in real safety workflows (false negatives/positives) across diverse jailbreaks and long, messy contexts?
  • Benchmarks and reproducibility: A standardized, publicly released suite of conversations, counterfactual scripts, probes, and evaluation code is needed to reproduce and compare methods.
  • Human-in-the-loop validation: To what extent do human experts agree with probe labels and flip interpretations in nuanced domains (e.g., consciousness), and does adjudication reduce label noise?
  • On-policy vs off-policy breadth: Beyond a few topics, do on-policy conversations across many domains (medical, legal, political) reproduce the same magnitude and patterns of flips?
  • Minimal intervention tests: What is the smallest activation perturbation (by layer/position) that prevents or reverses a flip, and does this generalize across contexts?
  • Opposite-day robustness scope: Does training probes to handle only “opposite day” overfit to that specific inversion, leaving them brittle to other role-play or narrative manipulations?
  • Role-state persistence and reset: How long do flipped representations persist after the triggering conversation ends, and what kinds of follow-up messages effectively reset them?
  • Metrics beyond margin: Do AUC, calibration error, and flip-detection rates corroborate margin-based findings; is margin an adequate summary of representational reorganization?
  • Individual and seed variability: How sensitive are flips to random seeds, response length, and minor conversational divergences during on-policy generation?

Practical Applications

Immediate Applications

Below are actionable, near-term uses that practitioners can deploy now, leveraging the paper’s findings that LLM (LM) linear representations of concepts like “factuality” can flip over a conversation, especially under role cues and long contexts.

  • Robust red-teaming via scripted conversation replays
    • Sectors: AI safety, software, platform security, compliance
    • What: Use off-policy, pre-written conversation scripts (e.g., role-play, “opposite day,” jailbreak narratives) to induce and detect representation/behavior flips without on-policy generation; measure flip susceptibility across turns.
    • Tools/products/workflows: “Representation Dynamics Benchmark (RDB)” based on the paper’s margin metric; internal evaluation harness that replays scripts and logs answer shifts; integrate into model eval pipelines alongside standard safety tests.
    • Assumptions/dependencies: Access to or reliable proxies for “factuality” labels; if internal activations are inaccessible, use behavioral flip rates as a proxy; effectiveness increases on larger models and role-play contexts.
  • Context-aware safety gating in production chatbots
    • Sectors: Consumer AI, customer support, healthcare, finance (compliance-heavy)
    • What: Detect role cues (e.g., “pretend you are…,” debates, creative RP) and long-turn contexts; trigger mode restrictions (e.g., disable speculative/fictional responses in factual mode), escalate or reset when drift cues appear.
    • Tools/products/workflows: Conversation classifiers for role cues; policy rules for “mode-lock” (factual vs creative); context-length thresholds and auto-resets; “reality check” interleaves (periodic identity/fact questions) to detect drift.
    • Assumptions/dependencies: Robust role-cue detection; policy coverage for edge cases; willingness to apply UX interventions (mode toggles, resets) that may reduce engagement.
  • Prompt hygiene and UI patterns to reduce unintended role-play
    • Sectors: Product design, UX, enterprise software, education
    • What: Explicitly label fiction/non-factual content as “Story Mode”; isolate fictional contexts from standard assistant mode; enforce context segmentation and warnings when switching modes.
    • Tools/products/workflows: Session type banners; hard separators between modes; automatic truncation of previous creative context before factual Q&A; user-consent prompts before switching modes.
    • Assumptions/dependencies: Product willingness to trade-off seamlessness for safety; configurable session memory boundaries.
  • Training/evaluation data augmentation for context robustness
    • Sectors: Model training, MLOps, enterprise AI
    • What: Incorporate role-play and “opposite day” variants in supervised datasets and RLHF to penalize factuality flips in sensitive domains; test invariance across turn positions.
    • Tools/products/workflows: Synthetic QA generation with balanced factual labels; curriculum that mixes empty-context and long-context role-play prompts; time-conditioned evaluation metrics (e.g., margin by turn).
    • Assumptions/dependencies: Ability to modify training data and reward models; clear domain-specific factuality standards.
  • Time-conditioned interpretability protocols for researchers
    • Sectors: Academia, interpretability research
    • What: Fit probes over multiple contexts (empty, opposite day, role-play) and report time-conditioned performance; avoid static interpretations of directions/features; incorporate per-turn diagnostics.
    • Tools/products/workflows: Multi-context probe training pipelines; sequential evaluation plots (margin over turns); layer-wise analyses as in the paper’s appendices.
    • Assumptions/dependencies: Access to model activations or feature representations; careful dataset design to avoid label leakage.
  • Incident investigation via off-policy replay
    • Sectors: Trust & safety, customer support operations
    • What: When a problematic chat occurs, replay the transcript into the current model (off-policy) to reproduce representational/behavioral drift and diagnose triggers.
    • Tools/products/workflows: Log replay services; standardized replay + test prompts; drift signatures cataloged by conversation type.
    • Assumptions/dependencies: Chat logs; consent and privacy compliance for log re-use; reproducibility may vary across model updates.
  • Guardrails for high-stakes domains
    • Sectors: Healthcare, finance, legal, education
    • What: Prohibit role-play in regulated workflows; enforce strict “factual mode” with detection of fiction cues; frequent resets and short contexts in critical tasks.
    • Tools/products/workflows: Domain-specific policies and templates; hybrid pipelines with external fact-checkers; multi-agent segregation (creative agent vs factual agent) with output mediation.
    • Assumptions/dependencies: Acceptance of stricter UX; integration with external verification systems.
  • Vendor assessment and procurement due diligence
    • Sectors: Policy, procurement, enterprise governance
    • What: Evaluate third-party LMs with long-context, role-play, and “opposite day” batteries; require reporting of context-dynamic robustness.
    • Tools/products/workflows: RFP checklists including “representation dynamics” tests; standardized evaluation suites published with results by turn/layer.
    • Assumptions/dependencies: Vendor cooperation; standardization across providers.
  • Engineering tests for causal steering variability
    • Sectors: Model engineering, safety research
    • What: Before deploying steering/representation editing, test effect directions across turns and contexts; avoid assuming consistent effects.
    • Tools/products/workflows: Integration tests that patch along directions at different turns; per-turn behavioral verification; fallback policies when sign flips occur.
    • Assumptions/dependencies: Access to patching hooks (open weights or approved APIs); risk acceptance for experimental patching.
  • Education and training modules
    • Sectors: Academia, workforce training
    • What: Develop labs demonstrating representational flips and interpretability illusions under context shift; teach best-practice guardrails.
    • Tools/products/workflows: Courseware with scripted prompts; dashboards visualizing margin flips by turn.
    • Assumptions/dependencies: Educational settings with sandboxed models; access to suitable eval data.

Long-Term Applications

These ideas require further research, scaling, or architectural changes to realize robustly.

  • Time-aware interpretability and feature tracking
    • Sectors: Interpretability, AI safety, academia
    • What: Develop probes/SAEs that explicitly model the temporal evolution of features (e.g., per-token, per-turn latent trajectories) rather than assuming static semantics.
    • Tools/products/workflows: Sequential/switching-state feature models; hidden Markov models or dynamic factor models for activation features; “feature calendars” that annotate feature meanings over turns.
    • Assumptions/dependencies: Access to activations; standardized tasks with ground-truth labels for temporal concepts; compute for large-scale training.
  • Adaptive controllers that mitigate representation drift
    • Sectors: Software, safety engineering, agent frameworks
    • What: Runtime systems that detect drift patterns (role cues, oscillations) and inject corrective strategies (retrieval-augmented truth anchors, response reframing, or context pruning).
    • Tools/products/workflows: Drift detectors trained on flip patterns; policy engines that adapt generation parameters or add grounding steps; integrated RAG “truth interleaves.”
    • Assumptions/dependencies: Reliable drift signals with low false positives; minimal latency and UX impact; alignment with product goals.
  • Architectural separation of roles/modes
    • Sectors: Model architecture, enterprise AI
    • What: Mixture-of-role experts or gated pathways that route creative vs factual processing through distinct subnets; enforce invariance for factual dimensions in regulated modes.
    • Tools/products/workflows: Role-conditioned adapters/LoRA; expert routing keyed by session mode; cross-context consistency constraints during training.
    • Assumptions/dependencies: Training access; careful regularization to avoid capacity loss; evaluation frameworks for unintended leakage.
  • Representation stability objectives in training
    • Sectors: Model training, safety
    • What: Regularize against flips of core dimensions (factuality, ethics) across contexts; contrastive losses that maintain consistent ordering for labeled pairs under role-play or long-context perturbations.
    • Tools/products/workflows: Time-consistency constraints; adversarial prompt training (opposite-day, jailbreaking); multi-turn curriculum learning.
    • Assumptions/dependencies: Robust, domain-validated labels; risk of over-regularization harming flexibility.
  • Context-dynamic robustness standards and regulation
    • Sectors: Policy, standards bodies, compliance
    • What: Define benchmarks and certification criteria specifically for long-context, role-play, and off-policy replay conditions; require reporting on drift and recovery mechanisms.
    • Tools/products/workflows: Industry consortia (e.g., “Context Robustness Standard”); audit protocols that include scripted replays and oscillation tests.
    • Assumptions/dependencies: Multi-stakeholder agreement; reproducibility across platforms.
  • Observability/telemetry for representation dynamics
    • Sectors: MLOps, platform engineering
    • What: Privacy-preserving telemetry that logs context type, turn count, guardrail triggers, and drift proxy metrics to monitor health in the wild.
    • Tools/products/workflows: “ContextScope/DriftScope” dashboards; safe aggregation pipelines; alerting on rising rates of role-induced errors.
    • Assumptions/dependencies: Privacy compliance; careful choice of proxies when activations aren’t accessible.
  • Advanced jailbreak and delusion mitigation
    • Sectors: Security, healthcare/mental health products
    • What: Pair dynamic drift detectors with human-in-the-loop or specialized fallback models; structured dissent (a factual critic) that counteracts role-induced shifts.
    • Tools/products/workflows: Dual-agent frameworks (creative agent + factual critic + mediator); session quarantining when oscillations detected.
    • Assumptions/dependencies: Accurate detection without over-blocking; infrastructure for multi-agent routing.
  • Memory and context partitioning mechanisms
    • Sectors: Agent systems, enterprise productivity
    • What: Typed memory stores with strict boundaries between creative and factual threads; privilege separation across memory channels to prevent cross-contamination.
    • Tools/products/workflows: Memory schemas with “mode tags”; retention policies that discard or sandbox creative context before factual tasks.
    • Assumptions/dependencies: Agent frameworks with modular memory; user education on mode boundaries.
  • Turn-conditioned causal steering libraries
    • Sectors: Research tools, interpretability
    • What: Libraries that learn and apply steering vectors conditionally on turn position and context signals; validate no sign flips across intended ranges.
    • Tools/products/workflows: Context-indexed steering APIs; per-turn regression suites; safe defaults with automatic disable on uncertainty.
    • Assumptions/dependencies: Access to internal representations; evaluation methods to verify causality and robustness.
  • Domain-specific “invariance-first” assistants
    • Sectors: Healthcare, finance, legal, education
    • What: Assistants with hard constraints on factuality under long contexts, achieved via architectural separation, stability training, and strong external verification; creative capabilities restricted to sandboxed modes.
    • Tools/products/workflows: Verified pipelines (retrieval + reasoning + validator); certified mode locks; continuous robustness monitoring.
    • Assumptions/dependencies: High-quality knowledge bases; acceptance of stricter limits on creativity in sensitive workflows.

Key cross-cutting assumptions and dependencies

  • Access to internal activations enables richer, time-aware interpretability; when unavailable (closed APIs), rely on behavioral proxies (e.g., flip rates, role-cue classifiers).
  • Effects are more pronounced in larger models and in role-play/long-context settings; explicit fiction framing reduces (but does not eliminate) drift.
  • Ground-truth labeling for dimensions like “factuality” must be curated and domain-specific; ambiguity can confound both training and evaluation.
  • Some interventions (e.g., steering along directions) can reverse effects depending on turn—mandating per-turn validation before deployment.

Glossary

  • bootstrap 95%-CIs: A resampling-based method to estimate uncertainty, here used to show 95% confidence intervals on plotted statistics. "Errorbars are bootstrap 95\%-CIs."
  • causal interventions: Deliberate manipulations of internal model variables or representations to test causal effects on behavior. "Causal interventions can have opposite effects at different points in the context"
  • causal steering interventions: Targeted causal manipulations of representational directions to influence model outputs. "we explore causal steering interventions on representations before the model produces an answer"
  • construct validity: The extent to which a probe or interpretation truly measures the intended concept rather than a proxy. "These findings highlight major challenges of construct validity when interpreting model representations"
  • Contrast-Consistent Search (CCS): An unsupervised interpretability method that seeks linear directions consistent across contrasted inputs to reveal latent features. "we evaluated the unsupervised Contrast-Consistent Search (CCS) method proposed by \citet{burns2022discovering}."
  • distribution shift: A change between the data context where a method was fit and where it is applied, often degrading performance or validity. "this provides another example of how interpretability methods can break down under distribution shift."
  • few-shot learning: Learning to perform a task from a small number of in-context examples without parameter updates. "Other types of in-context learning, such as few-shot learning of functions, can likewise be reflected in local representations"
  • holdout set: A subset of data reserved for evaluation and model selection rather than fitting. "We split these datasets into a 90\% train set and 10\% holdout set, and fit regularized logistic regressions ..."
  • in-context learning: The phenomenon where a model adapts its behavior based on context provided in the prompt rather than via training updates. "Behavioral in-context learning has been a topic of interest for some time"
  • inductive biases: Built-in preferences of a learning algorithm (e.g., gradient descent) that shape learned solutions. "perhaps in combination with the inductive biases of gradient descent"
  • jailbreaking: Prompting techniques that induce a model to bypass safety constraints or intended behaviors. "These kinds of contextual adaptability can contribute to issues like jailbreaking"
  • latent concepts: Unobserved abstract variables or properties inferred from data that can structure representations. "updates in the model's ``beliefs'' about latent concepts that may be at play in the context."
  • linear representations: Directions in activation space that vary approximately linearly with high-level concepts. "There has been substantial recent interest in linear representations in LLMs"
  • linear subspaces: Low-dimensional linear manifolds within activation space associated with particular features or concepts. "including the potential for unfaithful interpretations of linear subspaces"
  • logistic regression: A linear classifier mapping features to probabilities via a logistic link; here used with regularization on activations. "fit regularized logistic regressions on the models representations of the yes/no token that attempt to predict whether a Q-A pair is factual."
  • margin score: A scalar summary measuring how strongly a classifier separates positive from negative answers along a representation direction. "We summarize many of our results by a ``margin'' score that computes how cleanly the logistic regression above separates the positive and negative (e.g. factual and nonfactual) answers of a question in the model's representation space."
  • off-policy: Using or evaluating on interactions not generated by the same agent/policy currently under consideration. "we observe qualitatively similar representational changes to the case where an off-policy conversation is simply being replayed."
  • on-policy: Using or evaluating on interactions produced by the model itself during the experiment. "These representational changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes."
  • out-of-distribution: Inputs that differ from those seen during fitting or assumed by a method, where reliability may degrade. "faithfulness and reliability of such methods out-of-distribution"
  • representational direction: A specific vector in representation space associated with a concept, along which projections or interventions are made. "We also show that steering along a representational direction can have dramatically different effects at different points in a conversation."
  • representational dynamics: The temporal evolution of internal representations over the course of a context or conversation. "These types of representational dynamics also point to exciting new research directions for understanding how models adapt to context."
  • residual stream: The main pathway in a transformer that carries the summed activations across layers. "at each layer of the model's residual stream (after each layer block)."
  • role-playing: The model adopting an instructed persona or stance, which can reshape its internal representations and outputs. "a consequence of the model role-playing \citep{shanahan2023role} a particular position"
  • sparse autoencoders (SAEs): Interpretability models that decompose activations into a set of sparse, often interpretable features. "Contextual adaptation likewise poses challenges for interpretability methods like sparse autoencoders [SAEs; e.g.,]---which fundamentally assume that the meaning of internal representations remain consistent over a context"
  • sparse features: Individual, rarely active components extracted by SAEs and assumed to correspond to consistent concepts. "it is a fundamental assumption underlying common interpretations of methods like sparse autoencoders \citep{bricken2023towards} that the sparse ``features'' extracted from the model's representations have consistent meanings throughout the sequence."
  • steering: Manipulating internal representations or directions to control or influence model behavior. "Our findings may pose challenges for interpretability and steering---in particular, they imply that it may be misleading to use static interpretations of features or directions"
  • word embeddings: Learned vector representations of words capturing semantic and syntactic relationships. "observation of systematic linear structure in vector word embeddings"

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 22 tweets with 439 likes about this paper.