Papers
Topics
Authors
Recent
2000 character limit reached

Decomposing Theory of Mind: How Emotional Processing Mediates ToM Abilities in LLMs (2511.15895v1)

Published 19 Nov 2025 in cs.AI

Abstract: Recent work shows activation steering substantially improves LLMs' Theory of Mind (ToM) (Bortoletto et al. 2024), yet the mechanisms of what changes occur internally that leads to different outputs remains unclear. We propose decomposing ToM in LLMs by comparing steered versus baseline LLMs' activations using linear probes trained on 45 cognitive actions. We applied Contrastive Activation Addition (CAA) steering to Gemma-3-4B and evaluated it on 1,000 BigToM forward belief scenarios (Gandhi et al. 2023), we find improved performance on belief attribution tasks (32.5\% to 46.7\% accuracy) is mediated by activations processing emotional content : emotion perception (+2.23), emotion valuing (+2.20), while suppressing analytical processes: questioning (-0.78), convergent thinking (-1.59). This suggests that successful ToM abilities in LLMs are mediated by emotional understanding, not analytical reasoning.

Summary

  • The paper reveals that steering emotional processes boosts ToM accuracy by 14.2%, as evidenced by a rise from 32.5% to 46.7%.
  • The paper applies a mechanistic interpretability approach using 45 linear probes to identify critical activation shifts in mid-model layers.
  • The paper demonstrates that enhancing emotional and generative faculties results in a suppression of analytical reasoning in LLMs.

Decomposing ToM in LLMs: Emotional Processing as a Mediator

Introduction

This paper presents a mechanistic interpretability approach to dissect Theory of Mind (ToM) abilities in LLMs, specifically focusing on the role of emotional processing versus analytical reasoning. The authors apply Contrastive Activation Addition (CAA) steering to the Gemma-3-4B model and utilize a suite of 45 linear cognitive action probes. Evaluation on 1,000 BigToM belief attribution scenarios demonstrates that steering not only improves ToM task performance (an increase in accuracy from 32.5% to 46.7%) but also induces pronounced changes in the model’s internal activation structure. Notably, these changes reflect a systematic enhancement of emotional and generative cognitive processes and a concurrent suppression of analytical processes during successful perspective-taking.

Methods

The methodology integrates activation steering, probe-based interpretability, and controlled evaluation on ToM tasks. The paper defines a taxonomy of 45 cognitive actions, spanning Metacognitive, Analytical, Creative, Emotional, and Memory-related processes, constructed from cognitive science frameworks and literature.

Synthetic narratives exemplifying each action serve as training data for one-vs-rest linear probes, which are trained to decode action presence from activations captured across all 30 layers of Gemma-3-4B. The CAA steering vectors are trained using contrastive triplets from BigToM’s forward belief scenarios, ensuring that the vectors target representational differences underlying correct versus incorrect mental state attributions.

For evaluation, the model's answers to belief attribution questions are scored via logit-based ranking, and probe activations are measured at three critical timepoints: upon question presentation, after the correct answer, and after the incorrect answer. Statistical analysis focuses on difference of activation (steered minus baseline) for each probe and category, thereby elucidating which cognitive processes become more or less prominent under improved ToM performance.

Probe Performance and Layer Diagnostics

Validation of the linear probes indicates robust classification, with an average AUC-ROC of 0.78 across all cognitive actions and layers. Layer-wise inspection reveals a pronounced peak in probe discriminability within mid-layers (layers 5-24), with optimal cognitive action abstraction at layer 9 (AUC-ROC 0.948). Early layers are biased toward superficial linguistic features, while the final layers demonstrate dominance by next-token prediction dynamics, diminishing probe fidelity. Figure 1

Figure 1: Cognitive action probe performance across all 30 layers, peaking at layer 9, with mid-layers encoding the highest-level cognitive abstractions.

Furthermore, action-wise ranking exposes considerable heterogeneity. Actions such as suspending_judgment (AUC-ROC 0.988) and counterfactual_reasoning (0.984) manifest highly localized, easily detectable activation patterns. In contrast, emotion_responding (0.778) and understanding (0.837) appear more contextually or representationally distributed. Figure 2

Figure 2: Top and bottom cognitive actions by probe discriminability, indicating which mental faculties are most and least clearly represented.

Cognitive Shifts Induced by Steering

Application of CAA vectors produces a marked increase in ToM task accuracy, with a 14.2% improvement driven by a shift in the model's internal processing. Detailed probe analysis shows that this shift is characterized by increased activation of emotional and creative processes—specifically emotion_perception (Δ=+1.73), emotion_valuing (Δ=+0.85), emotion_understanding (Δ=+0.77), and hypothesis_generation (Δ=+1.63)—across all timepoints. Simultaneously, analytical processes decline: questioning (Δ=-1.24), convergent_thinking (Δ=-1.13), and understanding (Δ=-0.77).

The radar comparison and longitudinal probe assessment (Figure 3) illustrate the magnitude and consistency of this divergence. Figure 3

Figure 3: Cognitive action category comparison demonstrating the amplification of emotional and creative processes in steered models relative to baseline.

Complete timecourse analyses confirm these findings: emotional actions maintain elevations at question, post-correct answer, and post-incorrect answer stages, while analytical actions decrease across all phases. Figure 4

Figure 4: Time-resolved steering effects indicating systematic modulation of action categories, with persistent emotional activation increases and analytical decreases.

Category-level aggregation further corroborates that ToM improvement in LLMs is not an effect of increased overall cognition, but reflects a trade-off: enhanced emotional and creative faculties at the expense of classical analytical faculties. Figure 5

Figure 5: Category-wise quantification and distribution of steering effects at the answer timepoint, reinforcing the observed polarity shift.

Analysis of the most and least affected cognitive actions pinpoints the processes most radically altered by ToM steering. Figure 6

Figure 6: Top cognitive actions by activation change between baseline and steering, highlighting emotional and generative processes as primary drivers of successful ToM.

Visualization of probe differences across all actions and timepoints, as a heatmap, demonstrates the global and consistent nature of these shifts. Figure 7

Figure 7: Heatmap of cognitive action differences induced by steering, confirming categorical patterns.

Theoretical and Practical Implications

The results directly contradict the hypothesis that LLM-based ToM is predominantly driven by analytical or chain-of-thought instantiation. Instead, the findings substantiate a model wherein emotional processing forms the mechanistic substrate for belief attribution and successful perspective-taking in LLMs. This aligns with select neuroscience results suggesting that cognitive and affective ToM are tightly coupled in human neural architectures, especially in temporoparietal regions—though no direct correspondence is claimed here.

Practically, the mechanistic decomposition approach grants fine-grained interpretability, which may inform the design of safer, more reliable AI with explicit cognitive scaffolding. It further suggests that future advances in social reasoning tasks may depend more on representing emotional context and generative faculties, rather than optimizing deliberative, analytical routines.

The protocol offers a scalable template for cognitive decomposition: applying interpretability tools (probes, steering) to dissect and modulate internal faculties relevant for any complex AI capability.

Expanding this framework to larger, more diverse model architectures and broader benchmarks will be crucial for future mechanistic transparency. It also opens questions about the transferability of steering effects and whether analogous patterns are a universal property of neural architectures trained on linguistic data.

Conclusion

This paper provides a rigorous, probe-driven decomposition of the internal mechanisms through which activation steering improves ToM in LLMs. The analyses reveal that enhanced performance is mediated by increased engagement of emotional and generative cognitive processes and suppression of analytical reasoning, suggesting that the essence of successful perspective-taking in LLMs is fundamentally affective rather than deliberative. These insights challenge prevailing interpretive frameworks and pave the way for more nuanced investigations of high-level cognition in artificial agents.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

What is this paper about?

This paper looks at how to help an AI better understand what other people think and believe—something called “Theory of Mind” (ToM). The authors don’t just measure whether the AI gets ToM questions right; they try to peek inside the AI to see which kinds of thinking become stronger or weaker when the AI does better. Their big takeaway: when the AI improves at ToM, it seems to rely more on emotional understanding and less on pure step-by-step analytical reasoning.

What questions did the researchers ask?

The paper focuses on three simple questions:

  • If we use a known technique to boost an AI’s ToM, what changes inside the AI’s “thinking”?
  • Do some types of mental processes—like noticing emotions—become more active when the AI gets ToM questions right?
  • Are more logical, puzzle-solving processes actually less active when the AI succeeds at perspective-taking?

How did they do the research?

Think of a LLM like Gemma-3-4B as a huge text-predicting machine with many layers that process words step by step—like pages in a notebook where each page adds more detail.

The team used two main tools:

  1. “Activation steering” (Contrastive Activation Addition, or CAA)
  • Analogy: Imagine turning a knob inside the AI to push it toward answers that match correct perspective-taking and away from answers that match incorrect perspective-taking.
  • They trained this “steering knob” using pairs of examples where one answer shows good ToM (understanding what someone believes) and the other shows poor ToM.
  1. “Probes” to read internal signals for 45 cognitive actions
  • Analogy: A probe is like a detector that lights up when a certain type of thinking is active—such as “questioning,” “hypothesis generation,” or “emotion perception.”
  • They created short, first-person stories that strongly show one specific mental action (like “I realized my friend was sad from their tone”). Then they trained simple detectors to recognize those actions in the AI’s internal activity.
  • These detectors were applied to the AI at different layers (like checking different pages in the notebook) and at different moments (at the question, after a correct answer, after a wrong answer).

Testing the AI

  • The model answered 1,000 “false-belief” questions (like classic ToM tests: a person believes something that isn’t true because they didn’t see what happened).
  • The questions were multiple-choice. Instead of letting the model write anything it wanted, the researchers looked at which choice the model was more likely to pick based on its internal probabilities.
  • They compared performance before and after using CAA and checked which “thinking detectors” lit up more or less.

What did they find, and why does it matter?

Here are the main results in plain terms:

  • The steering technique made the AI better at ToM. Accuracy rose from 32.5% to 46.7%, meaning the model got 217 more questions right out of 1,000.
  • When the AI did better, its “emotional” thinking became more active. Detectors for things like emotion perception (spotting feelings) and emotion valuing (recognizing the importance of feelings) lit up more.
  • At the same time, some classic “analytical” thinking decreased. Detectors for questioning, convergent thinking (picking one best answer), and general “understanding” were less active.
  • Creative/generative thinking (like hypothesis generation—coming up with possible explanations) also increased.

Why this matters:

  • Many people assume that better social reasoning in AI comes from more careful step-by-step logic. But here, success was linked to stronger emotional and generative processing and weaker hard analytic checking.
  • This suggests that for perspective-taking, tuning the AI to “feel” the situation—recognize emotions and imagine explanations—may be more helpful than pushing it to do strict logical analysis.

What does this mean for the future?

  • Building better social AI: If we want AI that understands people’s beliefs and feelings, we may need to strengthen its emotional and generative representations—not just its logical reasoning.
  • Understanding AI “thinking”: Using probes plus steering gives a way to break down complex skills (like ToM) into parts and see which parts actually drive success.
  • Limits: The authors don’t claim this is how human brains work. They just show what changes inside one AI model when it gets better at ToM. Future work should test more models and more types of tasks.
  • Big picture: Perspective-taking in AI might be more about tuning the model to recognize emotional context and imagine possible beliefs than making it do longer chains of logic. This could guide how we train and evaluate social reasoning in AI going forward.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper, framed to be concrete and actionable for future research.

  • Probe training data is entirely synthetic and generated by the same model (Gemma-3-4B), risking circularity and model-specific biases; validate probes with human-annotated corpora and out-of-model data sources.
  • Linear probes achieve modest average performance (AUC-ROC ≈ 0.78); assess robustness with stronger methods (e.g., non-linear probes, causal tracing, sparse autoencoders) and include sanity checks (random labels, control features).
  • The fixed suffix and final-token extraction strategy may alter representations; test whether findings persist with native inputs, multi-token probing, and alternative extraction positions.
  • “Layer count” as the primary metric for action presence is underdefined (thresholding criteria not specified) and hard to interpret; compare against calibrated continuous scores, effect sizes, and attribution-based measures.
  • No statistical significance or uncertainty quantification for reported differences; report confidence intervals, p-values, and variance across random seeds and probe initializations.
  • Steering vectors trained on only 752 triplets may be underpowered and dataset-specific; expand training sets, ensure strict disjointness from evaluation items, and test for overfitting/leakage.
  • Generalization beyond a single model (Gemma-3-4B) is untested; replicate across architectures (decoder-only vs. encoder-decoder), sizes, and training regimes (pretrain-only vs. RLHF).
  • Evaluation scope is narrow (forward belief “false” condition, binary letter selection); extend to full BigToM suite (true beliefs, second-order beliefs, deception), other ToM benchmarks, and free-form generation settings.
  • The central conclusion that emotional processing mediates ToM is correlational; perform causal interventions (ablate/amplify emotion-related features) and mediation analyses to test necessity and sufficiency.
  • Probes may pick up emotional lexicon rather than latent emotional representations; include lexical controls, counterbalanced stimuli, and adversarial examples to disentangle semantics from surface cues.
  • Only CAA steering is examined; compare against alternative interventions (prompting, supervised fine-tuning, other representation engineering methods) to test whether the same cognitive shifts occur.
  • Off-target impacts of steering on other capabilities are unmeasured; quantify trade-offs on analytical reasoning, calibration, mathematical tasks, and safety-relevant behaviors.
  • Sensitivity to steering hyperparameters (vector magnitude, targeted layers, injection points) is not explored; perform systematic sweeps to map the parameter-performance surface.
  • Instance-level links between probe outputs and correctness are not analyzed; compute per-example mediation, partial correlations, and causal path models to determine which actions predict success.
  • Results for the “perspective_taking” probe (included in the taxonomy) are not reported; analyze whether changes in this probe directly track accuracy and steering effects.
  • Dataset artifacts and shortcut learning risks are unaddressed; profile the 217 improved items to see whether gains cluster in emotionally salient stories or exploit spurious cues.
  • Memory and metacognitive dimensions are underrepresented (single “remembering” action; limited reporting on metacognition); expand taxonomy (e.g., working memory, attentional control, planning) and evaluate their roles.
  • Temporal dynamics are coarse (three timepoints only); analyze token-level trajectories and attention patterns to capture richer sequencing of cognitive actions during inference.
  • Cross-lingual generalization remains unknown; evaluate on non-English ToM datasets to test whether observed patterns depend on English training distributions.
  • Reproducibility is limited (artifacts omitted for review); release code, probe weights, steering vectors, prompts, and hyperparameters to enable independent verification.
  • Calibration of probability-based answer selection is not examined; test robustness to temperature, logit scaling, option formatting, and positional biases beyond randomization.
  • No human baselines or inter-rater comparisons are provided; situate model gains relative to human performance on the same items to contextualize effect magnitude.
  • Construct validity of the cognitive-action taxonomy is assumed; conduct convergent/discriminant validity tests and factor analyses to confirm that probes track intended constructs in LLMs.
  • Peak probe performance is reported at layer 9, yet analysis focuses on layers 10–20; justify layer selection and test whether conclusions depend on layer window choice.
  • Necessity vs. correlation of increased emotional/generative signals is unresolved; design counterfactual steering that increases analytical processes and compare ToM outcomes.
  • Safety and ethical implications of amplifying “emotional” processing in LLMs are not discussed; assess risks of anthropomorphism, persuasion/manipulation, and miscalibrated socio-emotional behavior.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

Below are actionable, near-term uses that can be built with existing open-weight models and current activation-steering toolchains.

  • Empathy-mode customer support and sales assistants
    • Sectors: software, customer service, retail, finance (collections), telecom
    • Tools/workflows: add a Contrastive Activation Addition (CAA) “empathy vector” at inference to increase emotion_perception/valuing during sensitive conversations; expose an “Empathy slider” in the chat UI; monitor cognitive action probes in layers ~10–20 to verify the agent is in the intended regime
    • Assumptions/dependencies: access to model activations (open weights or API that allows activation additions), acceptance that reduced analytical interrogation may trade off with strict task accuracy; careful prompt routing so “empathy mode” is used only where appropriate
  • Triage and de-escalation bots for grievance handling
    • Sectors: government services, healthcare intake, insurance claims, HR
    • Tools/workflows: deploy a steered small LLM for first-response triage that prioritizes emotional understanding before escalating to a more analytical agent; use probes as real-time guards to ensure emotional processes are active during intake
    • Assumptions/dependencies: ethical policies for vulnerable populations; human-in-the-loop escalation; steer strength must be calibrated to avoid suppressing factual checks
  • Emotion-aware educational tutors and feedback generators
    • Sectors: education, edtech
    • Tools/workflows: enable steering when providing formative feedback to students; probes verify presence of hypothesis_generation and emotion_understanding; route to analytical mode for grading or proof checking
    • Assumptions/dependencies: task routing between “emotional” and “analytical” modes; content safety filters; student privacy
  • Writing assistants that adapt tone and perspective
    • Sectors: productivity software, enterprise collaboration
    • Tools/workflows: “Consider-the-reader” rewrite feature that boosts perspective_taking and emotion_perception; cognitive probes supply a dashboard view to authors showing which cognitive actions dominated the draft
    • Assumptions/dependencies: user consent for tone-shaping; integration with office suites; quality guardrails to prevent loss of factual precision
  • Model evaluation dashboards for social reasoning
    • Sectors: AI research, MLOps, model governance
    • Tools/workflows: integrate the paper’s 45 cognitive-action probes into observability to track shifts induced by prompts, system messages, or guardrails; correlate probe readouts with ToM benchmark performance (e.g., BigToM)
    • Assumptions/dependencies: probe generalization beyond Gemma-3-4B; internal activation access; acceptance that probes are correlational (AUC-ROC ~0.78)
  • Safer conversational UX via context-aware steering
    • Sectors: consumer apps, healthcare information, mental health support (non-diagnostic)
    • Tools/workflows: automatically enable empathy steering for high-affect contexts detected via probes; disable or reduce in tasks requiring exact calculation or legal advice
    • Assumptions/dependencies: clear user disclosures; not a substitute for professional care; policy-based routing; adherence to safety guidelines
  • Contact-center QA and coaching
    • Sectors: BPOs, enterprise operations
    • Tools/workflows: analyze call transcripts with probes to quantify presence of emotional vs analytical actions; coach agents on when to shift modes; simulate “better” responses by applying CAA to the same transcript
    • Assumptions/dependencies: transcription quality; privacy and consent; domain adaptation of probes to spoken language
  • Role-play and scenario simulators for soft-skills training
    • Sectors: L&D, HR, sales enablement, healthcare bedside manner training
    • Tools/workflows: steer bots toward affective ToM in role-play; use probe readouts to score and explain why a response exhibited good perspective-taking (emotion_perception up, questioning down)
    • Assumptions/dependencies: content review for bias; multi-turn stability of steering; evaluation beyond synthetic benchmarks
  • Product research and UX testing with empathic agents
    • Sectors: consumer research, design
    • Tools/workflows: run steered agents to explore customer perspectives; compare outputs with and without steering to surface emotionally salient pain points
    • Assumptions/dependencies: representativeness of model’s learned affective priors; triangulation with real user data
  • Governance checklists and procurement addenda
    • Sectors: public-sector IT, enterprise AI procurement
    • Tools/workflows: require vendors to report ToM-relevant probe metrics and steering controls; verify that empathy modes can be toggled, logged, and audited
    • Assumptions/dependencies: buyers able to demand activation-level transparency; standardized reporting templates

Long-Term Applications

These opportunities likely require further validation, scaling, cross-model replication, or new APIs that expose internal states safely.

  • Context-adaptive “cognitive control layer” for agents
    • Sectors: software platforms, multi-agent systems
    • Tools/products: middleware that dynamically steers between affective and analytical regimes per task segment; policy engines that bind business rules to cognitive-action targets
    • Dependencies: robust, model-agnostic steering APIs; stability across long contexts; meta-controllers that prevent mode oscillation
  • Socially adept robotics for care and companionship
    • Sectors: healthcare, eldercare, consumer robotics
    • Tools/products: onboard LLMs with affective steering for human-robot interaction; probe-based monitors to prevent overfitting to theatrics while maintaining safety
    • Dependencies: on-device inference and activation access; certification for safety-critical settings; longitudinal evaluation with humans
  • Clinical communication support (not diagnosis)
    • Sectors: healthcare
    • Tools/products: assistants that draft patient messages or visit summaries tuned for empathy; “bedside manner” simulators for clinician training
    • Dependencies: stringent compliance (HIPAA/GDPR), bias audits, domain-tuned probes; evidence that steering doesn’t degrade medical accuracy; human oversight
  • Negotiation and mediation copilots
    • Sectors: law, HR, enterprise negotiations, diplomacy training
    • Tools/products: multi-party assistants that track inferred beliefs and emotions; surface “belief maps” grounded in probe signals; apply CAA for tactful proposals
    • Dependencies: validated ToM beyond simple benchmarks; safeguards against manipulation; explainability requirements
  • Standardized audits and certifications for affective-ToM claims
    • Sectors: policy, standards bodies, insurers
    • Tools/products: certification regimes requiring decomposition metrics (probe profiles, steering effects, task trade-offs) before deployment in sensitive contexts
    • Dependencies: consensus test suites; independent labs; legal frameworks for disclosure
  • Training-time representation shaping for social reasoning
    • Sectors: AI model development
    • Tools/products: pretraining or finetuning with loss terms that encourage disentangled affective vs analytical circuits; joint training with probe feedback
    • Dependencies: scalable training-time interpretability; evidence that such shaping improves generalization without spurious correlations
  • Early warning systems for manipulative or deceptive behavior
    • Sectors: platform safety, trust & safety
    • Tools/products: probe-driven monitors that flag patterns of undue emotional leverage or suppressed analytical scrutiny; automatic throttle or human review triggers
    • Dependencies: validated mapping from probe signals to real-world harm; low false-positive rates; governance playbooks
  • Emotion-aware RAG and retrieval policies
    • Sectors: software, enterprise knowledge, customer success
    • Tools/products: retrieval strategies that adapt to user affect (e.g., prioritize supportive guidance before dense technical detail); steer analytical mode back on for final answers
    • Dependencies: orchestration frameworks that can read probe states; careful latency management; user testing at scale
  • Social science and cognitive modeling at scale
    • Sectors: academia
    • Tools/products: use steered LLM agents in controlled simulations to test hypotheses about affective vs cognitive ToM interactions; generate synthetic data with annotated “cognitive action” trajectories
    • Dependencies: cross-model replication; triangulation with human studies; norms for ethical use of synthetic social data
  • Personal agents that learn user-specific empathy profiles
    • Sectors: consumer AI, accessibility
    • Tools/products: agents that tune steering strengths to individual preferences and contexts (e.g., direct vs supportive communication styles)
    • Dependencies: privacy-preserving personalization; user controls and transparency; safeguards against behavioral manipulation

Cross-cutting assumptions and dependencies

  • Technical access: most applications require access to intermediate activations (currently feasible with open-weight models or specialized APIs) and the ability to apply CAA during inference.
  • Generalization: findings were demonstrated on Gemma-3-4B with BigToM forward-belief tasks; external validity to other models, languages, and richer social contexts must be established.
  • Trade-offs: steering increased emotional processes while suppressing analytical ones; tasks demanding precision, compliance, or formal reasoning must incorporate mode routing or post-hoc verification.
  • Measurement validity: linear probes (avg AUC-ROC ~0.78) provide correlational signals; they should be treated as indicators, not ground truth of “cognition.”
  • Safety and ethics: emotional steering can increase persuasive power; require disclosures, opt-in, and guardrails, especially in healthcare, finance, and youth-facing products.
  • Operational concerns: added latency and engineering complexity for hooking layers 10–20, calibration of steer strength, and continuous monitoring to manage distribution shift.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Activation patterns: Distributions of internal neuron activations across layers/tokens that reflect what the model is representing. "comparing steered versus baseline activation patterns could reveal which processes are essential for successful perspective-taking"
  • Activation steering: Modulating a model’s internal activations to influence its outputs toward desired behaviors. "Recent work shows activation steering substantially improves LLMs' Theory of Mind (ToM) \cite{Bortoletto2024}"
  • AdamW: An optimizer with decoupled weight decay commonly used for training deep neural networks. "with AdamW optimization, cosine annealing, and early stopping based on AUC-ROC."
  • Answer position randomization: Randomizing the order of answer options to prevent positional biases in evaluation. "Answer position randomization ensures that the model cannot exploit systematic biases in option ordering."
  • Answer ranking by probability: Selecting an answer based on predicted probabilities rather than generating text. "Following the BigToM evaluation protocol, we used answer ranking by probability rather than free-form text generation."
  • AUC-ROC: Area under the Receiver Operating Characteristic curve; a metric for binary classifier performance. "binary probes achieved 0.78 average AUC-ROC and 0.68 F1 across 45 actions."
  • Belief attribution tasks: Tasks requiring inference of an agent’s beliefs from context. "we find improved performance on belief attribution tasks (32.5\% to 46.7\% accuracy)"
  • BigToM: A benchmark dataset for evaluating Theory of Mind in LLMs. "evaluated it on 1,000 BigToM forward belief scenarios \cite{gandhi2023understanding}"
  • Contrastive Activation Addition (CAA): A steering method that adds contrastive activation differences to shift model representations toward desired outcomes. "activation steering techniques, particularly Contrastive Activation Addition (CAA), can substantially improve LLMs' ToM performance on belief attribution tasks"
  • Cosine annealing: A learning rate schedule that decays according to a cosine function. "with AdamW optimization, cosine annealing, and early stopping based on AUC-ROC."
  • Early stopping: Halting training when a validation metric stops improving to avoid overfitting. "with AdamW optimization, cosine annealing, and early stopping based on AUC-ROC."
  • Final token position: The last token index used to read off probabilities/logits for classification. "from model logits at the final token position"
  • Final-token extraction: Reading activations at the final token to evaluate or probe a representation. "for consistent final-token extraction."
  • Forward belief scenarios: ToM tasks where the correct belief tracks the forward (true) state of the world. "We evaluated 1,000 forward belief scenarios from BigToM \cite{gandhi2023understanding} (forward_belief_false)"
  • Gemma-3-4B: A 4-billion-parameter Gemma LLM used in experiments. "We applied Contrastive Activation Addition (CAA) steering to Gemma-3-4B"
  • Layer count: The number of layers where a probe detects the presence of a target feature/action. "For each action, we computed layer count (layers 10-20 where probe confidence indicated presence)"
  • Linear probes: Simple linear classifiers trained on model activations to detect the presence of specific features. "We trained 45 binary linear probes using one-vs-rest classification"
  • Logits: Pre-softmax scores output by a model that can be converted to probabilities. "Answers were evaluated by computing p(correct) vs p(incorrect) from model logits."
  • Mechanistic interpretability: Methods for understanding a model’s internal mechanisms and representations. "We introduce a decomposition approach using techniques from mechanistic interpretability in LLMs"
  • Middle Temporal Gyrus (MTG): A brain region implicated in social cognition and language processing. "patterns in TPJ (Temporoparietal Junction) and MTG (Middle Temporal Gyrus) reflect the same neuronal activity, equally recruited in these two independent conditions."
  • nnsight: A tool/library for extracting and inspecting neural network activations. "Activations were extracted from layers 0-30 of Gemma-3-4B using nnsight"
  • One-vs-rest classification: A strategy where a separate binary classifier is trained for each class against all others. "We trained 45 binary linear probes using one-vs-rest classification"
  • PCA-centered activation differences: Activation differences processed using Principal Component Analysis to center/structure steering vectors. "Vectors were trained across layers 14-30 using PCA-centered activation differences."
  • Perspective-taking: Inferring or modeling another agent’s viewpoint or beliefs. "This raises a question:what cognitive processes change when models successfully engage in perspective-taking?"
  • Probability-based ranking approach: Selecting answers by comparing their predicted probabilities instead of generating text. "This probability-based ranking approach eliminates confounds from text generation artifacts and provides a more reliable measure of the model's belief attribution capabilities."
  • Steering vectors: Directional vectors added to internal activations to guide model behavior. "We investigate how steering vectors modulate cognitive processes during belief attribution tasks."
  • Temporoparietal Junction (TPJ): A brain region associated with Theory of Mind and social cognition. "patterns in TPJ (Temporoparietal Junction) and MTG (Middle Temporal Gyrus) reflect the same neuronal activity, equally recruited in these two independent conditions."
  • Theory of Mind (ToM): The ability to attribute mental states to oneself and others. "Recent work shows activation steering substantially improves LLMs' Theory of Mind (ToM) \cite{Bortoletto2024}"
Dice Question Streamline Icon: https://streamlinehq.com

Authors (2)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 207 likes.

Upgrade to Pro to view all of the tweets about this paper: