Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift

Published 14 May 2026 in cs.HC | (2605.15455v1)

Abstract: Chatbot behavior is often opaque to users, as responses can shift unpredictably across a conversation, drifting toward sycophancy, toxicity, or other unsafe responses. This can leave users vulnerable, either being misled by overly agreeable AI or manipulated by a harmful chatbot that no longer behaves as intended. To address this, we introduce multi-turn neural transparency, an interface that surfaces an LLM's internal neural activations in real time to help users anticipate and recognize how behaviors change across turns. We construct behavioral vectors for six personality traits using methods from mechanistic interpretability, identifying directions in activation space that correlate with trait expression ($R² \geq 0.9$) via contrastive system prompts, and visualize trait expression using a sunburst and drift panel that updates at each turn. In a randomized controlled study (N = 246), participants predicted trait expression from a system prompt alone, then rated observed behavior after interacting with the chatbot for both assistant and role-play personas. We find that participants without visualization struggled to accurately evaluate traits (RMSE $\approx$ 0.6-0.7), while the inclusion of neural transparency significantly improved both anticipation and evaluation compared to no visualization (d = -0.34 to -0.49). The multi-turn dynamic visualization additionally outperformed the static single-turn visualization on holistic evaluation of model behavior (d = -0.32). Transparency also reduced overconfidence: participants without visualization grew more confident despite no gain in accuracy. These findings suggest that surfacing internal model representations to everyday users is a meaningful step toward more transparent and informed human-AI interaction.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that dynamic, multi-turn visualization of LLM neural activations robustly reduces calibration error, as shown by improved RMSE and significant Cohen's d values.
It introduces a novel interface combining sunburst charts and drift panels to visually represent trait-specific activations and highlight behavioral shifts over conversation turns.
The findings reveal that while transparency enhances magnitude calibration, it does not improve the accuracy of activation polarity, underscoring the need for further refinement.

Multi-Turn Neural Transparency: Mechanistic Visualization for Improved User Calibration to LLM Behavioral Drift

Introduction

The paper "Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift" (2605.15455) addresses persistent concerns over the opacity of LLM behavior in multi-turn conversational contexts. Opaque LLM responses, particularly behavioral drift towards sycophancy, toxicity, or unsafe states, remain a fundamental challenge for user trust and safety. While prior mechanistic interpretability and interface transparency research largely focused on static, single-turn introspection, this paper advances dynamic transparency by surfacing LLM internal activations longitudinally and correlating them with interpretable behavioral traits. The authors demonstrate, through a controlled experiment, that providing users real-time access to trait-specific model activations significantly improves calibration to shifting LLM behavior, especially in ambiguous, variable personas.

Mechanistic Trait Extraction and Validation

The methodological core leverages recent mechanistic interpretability research indicating that LLMs encode abstract features—such as personality traits—as linear directions in their activation space. The authors operationalize six behavioral traits relevant to psychological safety and stylistic character: empathy, toxicity, romanticness, sycophancy, sophistication, and roboticness. For each trait, behavioral vectors are derived via a contrastive activation procedure, utilizing difference-in-means across activation patterns elicited by highly specified system prompts. Each vector captures a direction causally responsible for trait expression.

Quantitative validation involved systematic regression of measured behavioral scores against trait intensity specified in synthetic prompts. The cosine similarity between the final token activation and the behavioral vector determines score magnitude. Empirical $R^2 \geq 0.90$ was reported for all traits, substantiating that the extracted directions robustly encode trait intensity and that scores scale faithfully with prompt-induced expression. The authors further address cross-trait score normalization via empirical rescaling, ensuring consistent interpretation across traits and conversation contexts.

Visualization Interface Design

To bridge mechanistic signals and user comprehension, the paper introduces a multi-component web interface comprised of a sunburst chart for snapshot state and a drift panel for trajectory visualization. The sunburst diagram translates activation scores into a radial layout, with sector coloring and extension indicating both trait polarity and magnitude. The drift panel dynamically tracks trait-specific scores over conversational turns, surfacing the most significant behavioral swings via interactive cues. This layered visualization design enables users to monitor both aggregate behavioral profiles and temporal evolution, with bidirectional contextual linkage between interface views and chat transcript.

The experiment operationalizes three visualization conditions: control (no visualization), single-turn (static snapshot), and multi-turn (dynamic update with drift cues). The multi-turn variant is designed to explicitly highlight behavioral drift and facilitate trajectory tracking, particularly in long-form interactions where model output may deviate severely from initial system prompt anchoring.

Empirical Study and Numerical Results

A randomized controlled study with 246 participants compared user calibration across visualization conditions and persona types (assistant vs. role-play). The primary outcome was calibration error (RMSE) between user ratings and ground-truth activations. Baseline (no visualization) error was high ( $\text{RMSE} \sim 0.6 - 0.7$ ) with poor sign accuracy, confirming that users cannot reliably anticipate or evaluate LLM behavioral state unaided, especially for the role-play persona—a persona exhibiting significantly higher activation variability.

Key empirical claims:

Transparency improves calibration: Any neural transparency (single-turn or multi-turn) significantly reduced calibration error compared to control (Cohen's $d = -0.34$ to $-0.49$ , $p < .01$ across all RMSE metrics).
Multi-turn outperforms static: Multi-turn dynamic visualization further improved holistic behavior evaluation over single-turn snapshots ( $d = -0.32$ , $p = .037$ ). The benefit was pronounced for ambiguous, high-variability personas.
Transparency mitigates overconfidence: Participants with visualization showed no increase in self-reported predictive ability post-interaction, in contrast to control, which increased despite poor calibration. This suggests that transparency tempers unwarranted user confidence.
The drift panel, rated above neutral for helpfulness, was frequently referenced for behavioral evaluation.

Contradictory result: The presence of visualization did not improve sign accuracy—the polarity of trait identification remained unchanged across conditions—indicating transparency primarily enhances magnitude calibration, not directional awareness.

Practical and Theoretical Implications

Practically, the results establish neural transparency as an actionable paradigm for mitigating user miscalibration in human-AI interaction, specifically with LLMs susceptible to behavioral drift. The interface enables real-time, non-technical monitoring of personality traits and behavioral swings, directly relevant for companion AI and emotionally supportive deployment contexts where drift is common and guardrails weaken with prolonged engagement. The pronounced calibration improvement in high-variability role-play settings highlights the necessity for dynamic transparency in less constrained, expressive persona configurations.

Theoretically, the formalization and empirical validation of trait vectors as linear directions in activation space bolster the mechanistic interpretability agenda, connecting abstract representation structure with user-facing behavioral consequences. The finding that transparency reduces metacognitive overconfidence extends prior research on explainability, suggesting an additional layer—faithful internal state surfacing—can recalibrate user intuition and trust. By quantifying persona-specific behavioral volatility, the paper also advances understanding of conversational context effects on model reliability.

Limitations and Ethical Considerations

The experiment relies on LLM-as-judge outputs and synthetic user message generation, which may not perfectly reflect human evaluation or conversational steering. Results from 10-minute chats may underestimate calibration challenges and transparency benefits in real-world, long-duration interactions. The possibility exists that interface transparency could be manipulated to mislead users, e.g., by selectively downplaying undesirable activation levels, posing safety and ethical risks. These concerns highlight the necessity for rigorous validation of transparency fidelity and informed user consent in high-stakes deployments.

Future Directions

The approach can be extended to capture longer-range behavioral dynamics and more complex explanations, moving beyond the "biggest swing" heuristic to nuanced descriptions of trait evolution across long horizons. Real-world replication with diverse user populations and conversational settings will be critical for generalization. Further work is warranted to validate behavioral score interpretation against human raters and refine cross-trait normalization. Integration with broader safety mechanisms and adaptive transparency feedback could augment application robustness.

Conclusion

This paper establishes multi-turn neural transparency as a viable, validated principle for aligning user calibration with dynamic LLM behavioral states in extended conversational settings. By extracting and visualizing faithful trait activations over time, the authors demonstrate improved user anticipation and evaluation, reduced metacognitive overconfidence, and differentiated impact based on persona variability. The paradigm promises safer, more informed interaction with expressive, drifting LLMs in emerging psychological and high-stakes contexts, warranting further exploration and adoption at scale.

Markdown Report Issue