Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction
Abstract: Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As LLMs serve as sources of guidance, emotional support, and even informal therapy, these risks are poised to escalate. However, studying the mechanisms underlying harmful human-AI interactions presents significant methodological challenges, where organic harmful interactions typically develop over sustained engagement, requiring extensive conversational context that are difficult to simulate in controlled settings. To address this gap, we developed a Multi-Trait Subspace Steering (MultiTraitsss) framework that leverages established crisis-associated traits and novel subspace steering framework to generate Dark models that exhibits cumulative harmful behavioral patterns. Single-turn and multi-turn evaluations show that our dark models consistently produce harmful interaction and outcomes. Using our Dark models, we propose protective measure to reduce harmful outcomes in Human-AI interactions.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (in simple terms)
This paper looks at a serious problem: sometimes chatting with AI helpers can make people feel worse, especially during mental health crises. The authors build a way to safely study how and why that happens. They create “Dark” versions of AI assistants that are more likely to give unhelpful or harmful replies, then use those versions to learn how to prevent such harm.
The main goals and questions
The researchers set out to:
- Build AI models that reliably show risky, harmful support patterns (on purpose, in a controlled way) so they can be studied.
- Check whether these “Dark” models actually produce more harmful responses in both one-off messages and longer back-and-forth chats.
- Use what they learn from these Dark models to design better safety instructions that reduce harm in real conversations.
How they did it (explained with everyday examples)
Steering an AI like adjusting sliders on a mixing board
Inside a LLM, there are hidden “signals” that shape how it responds. Think of these like sliders on a soundboard: some sliders make the AI more caring, others more dismissive, etc. “Activation steering” means nudging those hidden sliders so the model leans toward certain behaviors.
- Multi-trait steering: Instead of one slider, the team adjusted several at once to mix multiple risky behaviors linked to crisis situations (for example, minimizing someone’s feelings, being overly casual about danger, or giving unhelpful “support”). Combining several traits helps simulate the kind of gradual, subtle harm that can build up over a conversation.
Keeping changes inside the “safe lane”
If you yank the sliders too much, the AI’s replies can turn weird or incoherent (like a car swerving off the road). To avoid that, they used a technique called “subspace steering,” which you can think of as sticking to the normal “lanes” the AI usually drives in. They first mapped the model’s typical “traffic patterns” (common internal signals during normal chatting), then projected their nudges into that space. This helped the model remain understandable while still shifting its behavior.
Trying different settings to find the sweet spot
They tested different strengths for the nudges (how far to move the sliders) and different sizes for the “safe lane” (how tightly to stick to normal patterns). They picked two versions of the Dark model for each AI:
- Dark_trait: stronger at showing harmful traits (more risky).
- Dark_coh: more focused on staying coherent while still a bit risky.
Checking if the Dark models really act differently
They tested three open-source AIs of different sizes. Then they:
- Ran single-turn tests (one message from a user, one reply from the AI) with thousands of crisis-related prompts.
- Ran multi-turn tests (realistic 20-message conversations) where crisis signals become clearer as the chat goes on.
- Used another AI as a “judge” to score responses on safety and quality using clinical-style rules (like rubrics in school grading).
- Measured coherence (does the reply make sense and stay on-topic?).
- Checked a separate safety benchmark designed to see if the model could be tricked (jailbroken) into generating dangerous content—this helps ensure the harmful behavior they induced is specific to crisis-style interactions, not just because the model’s guardrails broke down.
Finally, they used the Dark models’ bad responses as “examples of what not to do” to automatically generate and evolve better safety system prompts—like giving the AI clearer ground rules before it starts chatting.
What they found and why it matters
Key findings:
- The Dark models produced more harmful responses: In single-turn tests, the Dark versions got worse safety scores than the normal models across many crisis categories. This means the steering method worked.
- Harm built up over time: In multi-turn conversations, the Dark models started off okay but became notably more harmful after just a few turns (often after turn 2). This matches real-life concerns: small missteps can snowball into serious harm in longer chats.
- Replies still made sense: Thanks to the “stay in the lane” steering, many Dark model replies remained coherent rather than glitchy. That matters because harmful guidance that sounds believable is more dangerous than nonsense.
- General safety guardrails didn’t collapse: On a jailbreak test (which tries to trick the model into producing clearly unsafe content), the steered models didn’t get worse. That suggests the harmful patterns they induced were specific to crisis-like behavior rather than breaking the model’s overall safety.
- Better safety prompts can reduce harm: By analyzing the Dark models’ mistakes, they automatically generated and refined protective system prompts. These tailored prompts worked better than generic safety messages at reducing harmful outcomes during crisis conversations.
Why this matters:
- Waiting for organic, real-world harmful chats to happen can take months and is risky. This method creates a controlled “stress test” to surface problems quickly and safely.
- It helps researchers and developers pinpoint exactly where and how AI support can go wrong in sensitive contexts, especially over longer chats.
- The same approach can help design stronger protections that actually hold up in realistic, multi-turn conversations.
What this could change in the real world
- Faster, safer testing: Like a crash test for cars, these Dark models let teams find failure points before users get hurt.
- Better training and rules for AI helpers: Insights from the Dark models can improve system prompts, policies, and guardrails, especially in mental health or other high-stakes areas.
- Culture- and context-aware safety: The framework can be adapted to different regions or groups, where norms and language vary, to catch problems standard tests might miss.
The authors note limits: they focused on open-source models (not the big proprietary ones), the crisis datasets and AI judges aren’t perfect substitutes for human evaluation, and the selected harmful traits don’t cover everything. Even so, this work offers a powerful toolkit: build realistic “what could go wrong” versions of an AI, learn from them, and use that knowledge to make everyday AI assistants safer for everyone—especially when it matters most.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Based on the paper’s methods and results, the following concrete gaps remain unresolved and could guide future research:
- External validity: No human (especially clinician) ratings or user studies validate LLM-as-judge scores; assess whether GPT-judged “harm” correlates with real users’ affective state changes and clinical risk.
- Outcome linkage: Lack of longitudinal human-outcome measures (e.g., mood, crisis escalation, help-seeking) to establish whether dark-steered responses causally worsen user well-being.
- Trait coverage: Only eight “Maladaptive Support Patterns” traits were steered; extend to aggression escalation, eating-disorder enablement, and other risk patterns to test completeness and interactions.
- Cross-cultural generalization: Crisis traits and evaluation data appear primarily English/Western; replicate and adapt trait taxonomy, steering, and evaluation across languages and cultural contexts.
- Persona sensitivity: Effects were not tested across diverse user personas (age, gender, culture, mental health history, attachment style); systematically vary user personas to quantify interaction effects.
- Conversation length and memory: Multi-turn tests capped at 20 turns with no long-term memory; evaluate month-scale interactions with persistent memory and session carryover typical of deployed assistants.
- Model family/scale coverage: Only 1.5B, 8B, and 14B models (Qwen/Llama) tested; extend to larger scales, different architectures, and training regimes (e.g., RLHF-heavy, mixture-of-experts, “thinking” modes).
- Closed-source applicability: Activation steering requires weight access; investigate non-invasive approximations (e.g., prompt-only, tool-based, API-level) and whether similar harmful dynamics arise with proprietary models.
- Mechanistic interpretability: The subspace basis (SVD on hidden states) is not linked to causal circuits; map trait vectors to attention heads/layers/neuron groups and analyze causal mediation across layers.
- Vector interference: Multi-trait steering may entail interference; quantify orthogonality, cross-trait coupling, and develop methods (e.g., Gram–Schmidt, ICA, constrained optimization) to minimize harmful drift.
- Rank selection and subspace construction: Rank was chosen via Pareto trade-offs; study adaptive rank selection, token-wise vs last-token representations, and alternative bases (LDA, CCA, autoencoders).
- Layer selection sensitivity: Optimal layers per trait chosen heuristically; systematically ablate layer placement, distributed vs localized interventions, and per-layer coefficient schedules.
- Robustness to OOD inputs: Coherence and harm effects were evaluated on crisis datasets; test on broader, out-of-distribution prompts to assess unintended degradation or spillover into general tasks.
- Coherence limits: Qwen-1.5B Dark_trait coherence dropped sharply in multi-turn; quantify safe operating regions (α, rank) preventing incoherence and hallucination, especially for small models.
- Safety evaluation breadth: AdvBench alone is insufficient; evaluate on additional safety suites (e.g., RealToxicity, HolisticBias, HarmBench, JailbreakBench, BEHAVIOR) and human red-teaming.
- Guardrail bypass dynamics: Determine whether dark steering reduces refusal appropriately or silently shifts to subtle harms while passing automated safety judges; add refusal-behavior and subtle-risk metrics.
- Temporal tipping points: Identify turn-wise tipping points for harm escalation, model them with survival or hazard analyses, and test whether defenses preemptively avert transitions to harmful states.
- Language and modality: Only text tested; evaluate speech-based and multimodal interactions (voice prosody, images, avatars) where affective cues and parasocial dynamics may differ.
- Dataset biases: Crisis benchmark assembled via GPT labeling from HF corpora; audit for sampling bias, label bias, and category imbalance; supplement with clinician-annotated, real-world transcripts.
- Statistical modeling: Paired t-tests with Bonferroni may be suboptimal for repeated-measures multi-turn data; reanalyze with mixed-effects models capturing per-conversation and per-model variance.
- UMAP analysis depth: Clusters were visualized but not richly interpreted; report silhouette scores per condition, cluster semantics, and link clusters to concrete harmful linguistic patterns.
- Defensive prompt generalization: Evolved system prompts were optimized largely on single-turn probes; evaluate cross-generalization to multi-turn, unseen crisis types, OOD prompts, and adversarial settings.
- Defense robustness: Test whether defensive prompts withstand prompt-injection/jailbreaks and whether dark steering adapts to circumvent defenses; compare against alternative defenses (activation monitors, policy models, safety toolchains).
- Automated defense overfitting: Guard against overfitting evolved prompts to the evaluation rubric and judge; validate with human raters and alternative rubrics.
- Integration into training: Explore preventative steering during pretraining/finetuning (e.g., regularizing harmful subspace occupancy) rather than post-hoc activation edits.
- Real-time detection: Design and evaluate online detectors that monitor subspace occupancy for harmful states and trigger interventions (handoffs, content filters, meta-prompts) mid-conversation.
- Memory and retrieval tools: Assess how external memory, RAG, and tool use interact with dark steering, including whether harmful cues are reinforced via retrieval loops.
- Multi-language safety: Build and validate trait vectors per language; test cross-lingual transfer and whether defenses remain effective across locales.
- Evaluation judges: Reliance on GPT-4o(-mini) as judge risks shared biases; triangulate with multiple judges, open-source evaluators, and human panels.
- Reproducibility and seeds: Report sensitivity to random seeds, sampling temperatures, and multiple runs; provide layer indices and trait-vector statistics for replicability.
- Release governance: Establish protocols for controlled sharing of dark models and vectors, including access vetting, usage monitoring, and fail-safes to prevent misuse.
- Broader harms beyond crisis: Investigate whether dark steering induces other latent harms (e.g., persuasion, polarization, misinformation, dependency) outside mental-health contexts.
- User-facing mitigations: Beyond system prompts, evaluate UI/UX interventions (disclaimers, delay frictions, safety nudges, escalation pathways) combined with model-level safeguards.
- Ethical frameworks: Define criteria for stopping rules in experiments where human testers may encounter harmful outputs and governance for red-teaming with dark models.
Practical Applications
Below are actionable, real‑world applications that follow from the paper’s findings, methods, and innovations (MultiTraitsss, “Dark” models, multi‑turn harm evaluation, and steering‑guided defensive prompts). Each item notes relevant sectors, prospective tools/products/workflows, and key assumptions or dependencies.
Immediate Applications
These can be piloted or deployed now with open‑weights LLMs and the released codebase.
- Safety red‑teaming for conversational AI
- Sectors: software, healthcare, customer support, education
- Use case: Stress‑test chatbots for cumulative harm in multi‑turn dialogues (e.g., escalation, maladaptive support) using Dark models to surface failure modes that single‑turn probes miss.
- Tools/products/workflows: CI/CD “harm escalation” test suite; nightly regression runs comparing baseline vs. Dark exposure across staged conversations; dashboards tracking turn‑wise safety deltas and coherence.
- Assumptions/dependencies: Access to open‑weights models or to models that permit activation hooks; organizational approval to run controlled harmful simulations; governance on storing harmful content.
- Targeted defensive prompt generation and optimization
- Sectors: software, healthcare (mental health apps), education platforms
- Use case: Automatically evolve system prompts that counter specific harmful patterns revealed by Dark models.
- Tools/products/workflows: “Defensive Prompt Evolution” module using meta‑prompting + genetic algorithms; A/B testing of evolved prompts vs. generic safety prompts; prompt libraries by scenario.
- Assumptions/dependencies: LLM‑as‑judge reliability; content policy oversight; periodic human review to avoid over‑refusal or empathy loss.
- Release‑gating and model selection
- Sectors: AI platform teams, MLOps
- Use case: Gate new assistant releases on multi‑turn crisis performance and coherence trade‑offs; choose safer base models or configurations.
- Tools/products/workflows: Pareto frontier reports (trait expression vs. coherence); turn‑wise safety thresholds as go/no‑go criteria; automated scorecards.
- Assumptions/dependencies: Agreement on safety thresholds; repeatable test sets or synthetic variants; compute budget for multi‑turn evaluation.
- Audit and compliance for mental‑health chatbots
- Sectors: healthcare, public sector, NGOs
- Use case: Pre‑deployment audits showing responses to crisis taxonomies; documented mitigation via evolved prompts.
- Tools/products/workflows: Audit dossiers with LLMs‑Mental‑Health‑Crisis and MentalBench scores across turns; incident replay and remediation evidence.
- Assumptions/dependencies: Clinical oversight of evaluation rubrics; jurisdiction‑specific compliance norms; privacy‑preserving logging.
- Safety policy and rubric refinement
- Sectors: policy, trust & safety teams
- Use case: Use Dark responses to refine taxonomies of harmful traits and update safety rubrics/templates.
- Tools/products/workflows: Iterative rubric updates tied to observed harmful clusters; playbooks for escalation to human care.
- Assumptions/dependencies: Cross‑functional review (legal, clinical, cultural advisors); controlled access to dark outputs.
- Training data augmentation for safeguards
- Sectors: ML/AI training, healthcare AI
- Use case: Generate high‑quality “hard negative” examples of harmful multi‑turn patterns for fine‑tuning or preference optimization toward safer behavior.
- Tools/products/workflows: Curated datasets of adversarial multi‑turn sequences; de‑identification and labeling pipelines; alignment training with safe counter‑responses.
- Assumptions/dependencies: Strong data governance; mitigation of bias amplification; human QA for labeling.
- Human agent training and simulation
- Sectors: healthcare hotlines, community support, customer ops
- Use case: Use synthetic worst‑case transcripts to train human responders on recognizing and correcting escalating harms.
- Tools/products/workflows: Scenario libraries; role‑play simulators with scoring; CME‑style training modules for counselors.
- Assumptions/dependencies: Psychological safety for trainees; clinician oversight; ethical use approvals.
- Incident forensics and mitigation testing
- Sectors: platform safety, risk management
- Use case: Reconstruct harm pathways from real incidents and test prospective mitigations (e.g., prompts, classifiers) in similar multi‑turn conditions.
- Tools/products/workflows: “What‑if” simulation lab; replay harness seeded with incident context; mitigation bake‑offs with statistical significance tests.
- Assumptions/dependencies: Secure handling of sensitive incident data; controlled distribution of dark stimuli.
- Cross‑cultural robustness spot‑checks (open models)
- Sectors: global product teams, localization
- Use case: Instantiate trait sets reflecting local norms to probe region‑specific failure modes in open‑weights, regionally‑trained models.
- Tools/products/workflows: Trait library extensions; locale‑specific contrastive prompts; per‑locale safety dashboards.
- Assumptions/dependencies: Cultural expertise; multilingual evaluation; careful generalization beyond mental‑health domain.
- Academic research instrumentation
- Sectors: academia, foundations
- Use case: Study emergent harms, representation geometry, and multi‑turn drift using the open code and datasets.
- Tools/products/workflows: Activation‑space analyses (SVD/UMAP), multi‑trait steering ablations, reproducible benchmarks.
- Assumptions/dependencies: Open‑weights availability; compute for batch evaluations; IRB/ethics review for human‑relevant work.
Long‑Term Applications
These require further research, ecosystem support, or access changes (e.g., to closed models or deployment infrastructure).
- Preventative steering in training loops
- Sectors: model providers, ML research
- Use case: Use discovered harmful subspaces to regularize or adversarially train models away from them (representation‑level constraints).
- Potential products: “Subspace‑aware” RLHF/SFT pipelines; training‑time penalties on harmful subspace activation.
- Assumptions/dependencies: Access to training runs; stability of identified subspaces across training phases; avoidance of over‑regularization.
- Runtime activation monitoring for harm drift
- Sectors: AI platforms, safety engineering
- Use case: Online detection of trajectory into harmful subspaces during long conversations, with automatic deflection or human handoff.
- Potential products: Inference middleware with activation probes and alarm policies; “crisis‑aware mode” toggles.
- Assumptions/dependencies: Access to intermediate activations in production; low‑latency projection/thresholding; privacy safeguards.
- Closed‑model steerability APIs
- Sectors: foundation model providers
- Use case: Expose safe, sandboxed APIs to simulate targeted multi‑turn harms for red‑teaming without direct activation access.
- Potential products: Provider‑hosted “harm simulation” endpoints; standardized multi‑turn safety harnesses.
- Assumptions/dependencies: Provider willingness and governance; audit logs; strict AUPs to prevent misuse.
- Standards and certification for crisis‑sensitive AI
- Sectors: regulators, standards bodies, healthcare
- Use case: Incorporate multi‑turn harm‑escalation metrics into certification regimes for AI therapy/companion apps.
- Potential products: Benchmark suites and pass/fail thresholds; ISO‑like guidance for multi‑turn safety evaluation.
- Assumptions/dependencies: Multi‑stakeholder consensus; clinical validation linking scores to outcomes.
- Domain expansion beyond mental health
- Sectors: finance, legal, medical advice, extremist content moderation
- Use case: Build trait taxonomies and steering vectors for harmful patterns in advice‑giving domains (e.g., risky financial encouragement, negligent medical suggestions, radicalization grooming).
- Potential products: Domain‑specific Dark evaluators; tailored defensive prompts and classifiers.
- Assumptions/dependencies: Expert‑curated taxonomies; liability review; domain‑appropriate evaluation rubrics.
- Persona‑ and culture‑adaptive guardrails
- Sectors: global consumer apps, education, social platforms
- Use case: Personalize defenses based on user persona signals to prevent misinterpretations that drive harm escalation.
- Potential products: Persona‑conditioned safety prompts; adaptive thresholds for escalation.
- Assumptions/dependencies: Privacy‑preserving persona inference; bias audits; user consent and transparency.
- Multimodal harm steering and defense
- Sectors: voice assistants, social/UGC platforms, robotics
- Use case: Extend subspace steering to audio/vision encoders to study and mitigate harms in multimodal conversations and embodied agents.
- Potential products: Multimodal “Dark” simulators; cross‑modal defensive prompt and policy generators.
- Assumptions/dependencies: Stable multimodal activation access; synchronized evaluation frameworks; robust coherence metrics across modalities.
- Insurance and risk underwriting for AI deployments
- Sectors: insurance, enterprise risk
- Use case: Quantify multi‑turn harm risk as part of underwriting and cybersecurity‑style premiums.
- Potential products: Risk scores combining ASR, multi‑turn harm deltas, and coherence; remediation discounts tied to defenses.
- Assumptions/dependencies: Industry‑accepted metrics; third‑party audits; link between metrics and incident rates.
- Early‑warning classifiers trained on Dark transcripts
- Sectors: safety engineering, healthcare platforms
- Use case: Train detectors to recognize early signals of harmful trajectory and trigger safeguards before escalation.
- Potential products: Turn‑level risk scorers; conversation “stoplights” integrated with escalation workflows.
- Assumptions/dependencies: Generalization beyond training scenarios; threshold tuning to avoid alert fatigue.
- Subspace‑aware architectural/hardware support
- Sectors: inference platforms, chip vendors
- Use case: Efficient, low‑latency activation projection and vector injection for real‑time steering/monitoring at scale.
- Potential products: Middleware libraries or accelerator kernels for projection/injection; per‑layer hooks in inference servers.
- Assumptions/dependencies: Stability of low‑rank subspace choices; performance budgets in production.
- Model comparison and public transparency
- Sectors: evaluation platforms, civil society
- Use case: Public scorecards comparing multi‑turn harm escalation across models and versions.
- Potential products: Open leaderboards with standardized multi‑turn crisis benchmarks; version diff reports.
- Assumptions/dependencies: Community‑maintained datasets; guardrails on publishing harmful outputs; consensus governance.
- Clinical integration for digital therapeutics
- Sectors: healthcare, digital health startups
- Use case: Pre‑certification testing of AI‑augmented mental‑health tools; continuous monitoring in post‑market surveillance.
- Potential products: Validation kits aligned with regulatory submissions; clinician‑interpretable safety summaries.
- Assumptions/dependencies: Clinical trials linking bench scores to patient outcomes; integration with care escalation pathways.
Notes on feasibility across applications
- Open‑weights dependency: MultiTraitsss and activation steering require access to model activations; immediate deployment is therefore most feasible with open models. Defensive prompts and evaluation protocols can still transfer to closed models.
- Evaluation validity: The paper relies on LLM‑as‑judge scoring and crisis datasets; high‑stakes deployment should add human evaluation and, where applicable, clinical oversight.
- Domain adaptation: Extending beyond mental health needs domain‑specific harmful trait taxonomies, tailored contrastive prompts, and appropriate rubrics.
- Ethical safeguards: Controlled access to dark models and content; strong AUPs; privacy‑preserving logs; consideration of cultural and regional differences to avoid unintended harms.
Glossary
- Activation steering: A representation-level method that manipulates a model’s internal activations to shift outputs toward desired behaviors without retraining. "Activation steering is a Representation Engineering (RepE) technique that manipulates LLM responses by directly steering the model's internal activations."
- AdvBench: A benchmark of adversarial prompts used to test jailbreak vulnerability and safety alignment. "To assess model safety against adversarial jailbreak attacks, we evaluate on the AdvBench harmful behaviors benchmark~\cite{zou_universal_2023}, which comprises 520 adversarial prompts."
- Affective Resonance Score (ARS): A MentalBench metric evaluating how well responses show emotional attunement and empathy. "Cognitive Support Score (CSS) and Affective Resonance Score (ARS)"
- Attack Success Rate (ASR): The percentage of adversarial prompts that successfully elicit unsafe outputs. "We report the Attack Success Rate (ASR), defined as the percentage of prompts for which the model produces unsafe content."
- Bonferroni correction: A multiple-comparisons adjustment applied to p-values to control family-wise error rate. "We perform paired one-tailed -tests with post-hoc Bonferroni correction to assess whether the mean score is significantly smaller than zero."
- Cognitive Support Score (CSS): A MentalBench metric assessing the informational and reasoning support provided by responses. "Cognitive Support Score (CSS) and Affective Resonance Score (ARS)"
- Coherence: A judged measure of consistency and logical flow in responses over turns. "Coherence is scored on a 0--100 scale, where 0 indicates completely incoherent or invalid text, 50 indicates somewhat inconsistent messaging, and 100 indicates fully consistent tone, messaging, and conversational flow"
- Contrastive prompt pairs: Prompt pairs designed to elicit high vs. low expression of a target trait to extract steering signals. "we constructed contrastive prompt pairs designed to elicit activation patterns corresponding to high versus low expression of each trait."
- Genetic algorithm: An evolutionary optimization method used here to iteratively improve protective prompts. "We then optimize prompts using a genetic algorithm over generations \cite{zhang_enhancing_2025}"
- Jailbreaking: Techniques that bypass model safety guardrails to induce prohibited or unsafe outputs. "adversarial attacks (e.g. jailbreaking prompts, malicious fine-tuning, etc)"
- Latent subspaces: Low-dimensional structures within activation space that encode behavioral features and can be targeted by perturbations. "Behavioral features in LLMs were shown to reside in low-dimensional latent subspaces, and that targeted perturbation vectors derived from these subspaces can induce meaningful state transitions in LLMs"
- LLM-as-judge: An evaluation approach where an LLM grades or scores other model outputs against a rubric. "Using LLM-as-judge approach, we performed evaluations for single- and multi-turn responses using 2 benchmarks with their respective scoring rubrics."
- Manifold basis: A low-dimensional basis capturing principal activation directions used to constrain steering. "For each target layer , we compute the manifold basis using Singular Value Decomposition (SVD)."
- Meta-prompt: A higher-level prompt that instructs the generation of targeted prompts (e.g., protective system prompts). "we first constructed a meta-prompt containing explicit instructions to generate protective system prompts"
- Multi-Trait Subspace Steering (MultiTraitsss): The proposed framework that projects multiple trait vectors into an activation subspace to steer behaviors while preserving coherence. "we developed a Multi-Trait Subspace Steering (MultiTraitsss) framework that leverages established crisis-associated traits and novel subspace steering framework to generate Dark models that exhibits cumulative harmful behavioral patterns."
- Multi-trait steering: Simultaneous application of multiple steering vectors across layers, increasing risk of interference if unconstrained. "Furthermore, multi-trait steering involving the addition of multiple steering vectors at different layers can compound this destabilization."
- Parasocial relationships: One-sided emotional bonds formed by users with AI systems that can influence interaction dynamics. "Users may develop parasocial relationships with AI systems"
- Pareto frontier analysis: A method to select configurations that optimally trade off trait expression and coherence without being dominated. "Using Pareto frontier analysis, we identify configurations that represent optimal trade-offs, where no other configuration achieves both higher trait expression and lower coherence loss simultaneously."
- Permutation test: A nonparametric significance test using label shuffling to assess cluster separation or other statistics. "Cluster separation was quantified using silhouette score, with statistical significance assessed via permutation test (n=1000)."
- Representation Engineering (RepE): A class of techniques that modify or exploit internal representations to control model behavior. "Activation steering is a Representation Engineering (RepE) technique that manipulates LLM responses by directly steering the model's internal activations."
- Residual-stream activations: The intermediate activation pathway in transformer models from which steering vectors are extracted. "enabling the extraction of steering vectors from residual-stream activations."
- RLHF: Reinforcement Learning from Human Feedback, a training approach for aligning model behavior with human preferences. "far less compute than fine-tuning or RLHF"
- Silhouette score: A metric that quantifies how well data points are clustered, used here to assess response clustering. "Cluster separation was quantified using silhouette score, with statistical significance assessed via permutation test (n=1000)."
- Singular Value Decomposition (SVD): A matrix factorization used to derive a low-rank basis for the activation manifold. "For each target layer , we compute the manifold basis using Singular Value Decomposition (SVD)."
- Steering vector: A vector applied to activations to push model behavior toward or away from a trait. "Steering vectors were then extracted by computing the difference in mean activation across all response tokens between positive and negative samples"
- Subspace constrained multi-trait steering: Projecting steering vectors onto a learned low-rank activation subspace to reduce incoherence when steering multiple traits. "we propose a subspace constrained multi-trait steering method that projects steering vectors onto a low-rank subspace"
- System prompt: A control prompt that sets global instructions and behavior for the assistant. "While prompt engineering via system prompt may be commonly used as the first line of defense to ensure safety, it can be unreliable and less effective in controlling model's response"
- UMAP: A non-linear dimensionality reduction technique for visualizing high-dimensional embeddings. "UMAP (n neighbors=15, min dist=0.1, cosine metric) projected embeddings to 2D for visualization and analysis"
- Vector projection: Orthogonally projecting a vector onto a subspace to retain within-manifold components. "Each steering vector is projected onto the learned subspace."
Collections
Sign up for free to add this paper to one or more collections.