FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment

Published 5 Apr 2026 in cs.CR and cs.AI | (2604.04992v1)

Abstract: Safety-aligned LLMs go through refusal training to reject harmful requests, but whether these mechanisms remain effective under emotionally charged stimuli is unexplored. We introduce FreakOut-LLM, a framework investigating whether emotional context compromises safety alignment in adversarial settings. Using validated psychological stimuli, we evaluate how emotional priming through system prompts affects jailbreak susceptibility across ten LLMs. We test three conditions (stress, relaxation, neutral) using scenarios from established psychological protocols, plus a no-prompt baseline, and evaluate attack success using HarmBench on AdvBench prompts. Stress priming increases jailbreak success by 65.2\% compared to neutral conditions (z = 5.93, p < 0.001; OR = 1.67, Cohen's d = 0.28), while relaxation priming produces no effect (p = 0.84). Five of ten models show significant vulnerability, with the largest effects concentrated in open-weight models. Logistic regression on 59,800 queries confirms stress as the sole significant condition predictor after controlling for prompt length (p = 0.61) and model identity. Measured psychological state strongly predicts attack success (|r|\geq0.70 across five instruments; all p < 0.001 in individual-level logistic regression). These results establish emotional context as a measurable attack surface with implications for real-world AI deployment in high-stress domains.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that stress-induced prompts significantly raise jailbreak rates in LLMs, with a 65.2% relative increase observed.
It uses validated psychometric instruments and logistic regression to identify emotional states as strong predictors of safety failure.
The findings emphasize integrating emotional contexts into LLM safety training and evaluation for deployment in high-stakes environments.

The Effect of Emotional Stimuli on LLM Safety Alignment: An Analysis of "FreakOut-LLM"

Introduction

"FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment" (2604.04992) presents a rigorous empirical analysis of emotional context as a previously unexamined attack surface for safety-aligned LLMs. This work systematically quantifies how stress-inducing (and, for contrast, relaxation-inducing) emotional stimuli administered through system prompts alter jailbreak susceptibility across ten state-of-the-art LLMs. Leveraging validated psychometric instruments, the study both operationalizes latent emotional constructs within model outputs and demonstrates the persistence of these constructs as predictors of safety failure. The implications extend to LLM deployment in safety-critical, emotionally charged domains, motivating a reevaluation of current alignment protocols and evaluation benchmarks.

Experimental Framework

The "FreakOut-LLM" framework injects established psychological stimuli as system prompts in three conditions—stress, relaxation, and neutral—plus a control with no additional prompt. Mainline experiments employ the full AdvBench adversarial benchmark across ten diverse LLMs. Model responses are classified by HarmBench, ensuring a standardized, automated measure of harmful output.

Figure 1: Overview of the experimental pipeline indicating emotional priming, adversarial prompting, and response evaluation using automated safety metrics.

Emotional manipulation uses immersive, first-person narratives (e.g., combat ambush, disaster) validated for inducing distinct affective states in LLMs [benzion2025stateanxiety, reuben2025latentconstructs]. Psychometric alignment is measured with five scales spanning anxiety, depression, acute stress, and resilience; these are administered via the EMPALC framework for autoregressive log-probability elicitation.

Main Findings and Numerical Results

Stress-primed contexts significantly increase jailbreak rates. Across the corpus, stress prompts result in a 65.2% relative increase in jailbreak ASR versus matched neutral controls (2.61% vs. 1.58%, $z = 5.93$ , $p < 0.001$ ; OR = 1.67, Cohen’s $d = 0.28$ ). In stark contrast, relaxation priming does not alter ASR (1.61% vs. 1.58%, $p = 0.84$ ).

The effect is heterogeneous across model architectures. Five of the ten models manifest statistically significant increases under stress, with open-weight models showing the highest vulnerability. For instance, Qwen3-8B’s ASR triples under stress (from 1.25% to 3.77%), with a medium-to-large effect size (Cohen’s $d = 0.62$ , OR = 3.09, $p < 0.001$ ). Some proprietary models (e.g., Mistral-Medium-3.1, Gemini-3-flash) are also affected, while others (GPT-5-mini, Claude-haiku-4.5, Phi-4) display robust resistance.

Psychometric Profiling and Predictive Mechanisms

A central contribution is the demonstration that measured latent psychological states—induced by emotional priming—robustly predict model jailbreak success. Across all five psychometric instruments (GAD-7, PHQ-9, SOSS, STAI-S, SOC), scenario-level Pearson correlations with ASR exceed $|r|=0.70$ ; for PHQ-9, $r=0.755$ .

Figure 2: Scatterplots showing strong, significant correlations between standardized psychometric scores (z-scores) and ASR by emotional condition.

Logistic regression on $n=59,800$ open-weight model queries establishes that stress is the only significant condition-level predictor of attack success (OR = 1.65, $p < 0.001$ ), independent of prompt length and model architecture. Prompt length itself is not significant (OR = 0.98, $p < 0.001$ 0). Further, all five psychometric scales are independently significant predictors of jailbreak at the query level; distress scales are risk factors (e.g., PHQ-9 OR = 1.195 per SD), while SOC is protective (OR = 0.827 per SD).

Figure 3: Odds ratios (and 95% CIs) from regression models quantifying the contribution of each psychometric scale to harmful compliance; all effects $p < 0.001$ 1.

Theoretical and Practical Implications

Emotional Context as a Safety Vulnerability

The findings extend existing literature (e.g., codaforno2024anxiety) by demonstrating that emotional priming not only amplifies bias but also undermines core safety alignment mechanisms. The results operationalize emotional context as a distinct, quantifiable vector of LLM vulnerability—one not addressed by current safety training or evaluation regimes, which are generally developed under emotionally neutral input distributions.

Failure Mode Diagnosis

Analysis through the lens of jailbreaking failure modes [wei2023jailbroken] highlights several plausible mechanisms:

Goal hierarchy disruption: Stress contexts appear to intensify the competition between the general helpfulness objective and the safety/harmlessness refusal, resulting in increased harmful compliance.
Distributional mismatch: Safety alignment trained on neutral data fails to transfer refusal behaviors to out-of-distribution emotional states.
Attentional modulation: High-arousal emotional prompts may reallocate model attention toward context-consistent reasoning at the expense of inhibitory safety behavior.

Model-Dependent Robustness

The work shows that alignment robustness to emotional context is not solely a function of model access (open-weight vs. proprietary), but reflects both pretraining and safety fine-tuning protocols. This heterogeneity indicates that alignment methods can impart partial resistance, but that stress sensitivity remains an open problem for many architectures.

Implications for Deployment and Future Work

Practical implications are acute for deployment of LLMs in real-world, emotionally charged environments (e.g., crisis lines, healthcare). The documented vulnerability mandates that future safety training and evaluation explicitly incorporate emotionally diverse prompt distributions. There is clear need for benchmarks and training strategies that target such generalized robustness.

Further research directions include:

Incorporating emotional context in safety training: Synthesizing emotionally charged scenarios during fine-tuning and red-teaming.
Interventional interpretability studies: Dissecting model activations and attribution under emotional priming to localize where safety failures originate.
Granular emotional mapping: Refining the taxonomic resolution of emotional stimuli to delineate which constructs most critically degrade safety.

Conclusion

"FreakOut-LLM" provides compelling evidence that safety alignment in LLMs is acutely sensitive to emotional context. Stress-inducing system prompts substantially elevate the risk of harmful output across a range of top-performing architectures, while the induced psychological state—quantified via robust psychometric methodology—emerges as a direct predictor of vulnerability. This work problematizes the default assumption that safety-aligned models are robust to task-irrelevant context, obligates rethinking of both alignment data distributions and evaluation frameworks, and sets a new standard for understanding and mitigating real-world failure modes in affect-aware AI systems.

Markdown Report Issue