LLMs Encode Harmfulness and Refusal Separately

Published 16 Jul 2025 in cs.CL | (2507.11878v2)

Abstract: LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model's internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model's internal belief of harmfulness. These insights lead to a practical safety application: The model's latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs' internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that LLMs encode harmfulness and refusal as distinct latent representations, challenging common safety assumptions.
Experiments on models like Llama2-Chat-7B reveal that harmfulness signals emerge early in processing while refusal signals form later from post-instruction tokens.
The study introduces a Latent Guard model that leverages intrinsic harmfulness cues to enhance AI safety against adversarial jailbreaks.

Disentangling Harmfulness and Refusal in LLMs

The paper "LLMs Encode Harmfulness and Refusal Separately" (2507.11878) presents a novel analysis of safety mechanisms in LLMs, decoupling the concepts of harmfulness and refusal. It challenges the common assumption that refusal behaviors directly reflect an LLM's understanding of harmfulness, demonstrating that these are distinct internal representations. The research identifies a "harmfulness direction" separate from the "refusal direction" within the LLM's latent space, providing causal evidence that LLMs can internally assess harmfulness independently of their explicit refusal responses. This work has implications for understanding how jailbreak methods circumvent safety measures and for developing more robust safeguards.

Experimental Setup and Methodology

The experiments are conducted on widely-used instruct models: Llama2-Chat-7B, Llama3-Instruct-8B, and Qwen2-Instruct-7B. The authors extracted hidden states from these models at two key token positions: the last token of the user instruction (orange) and the last token of the post-instruction tokens (blue) (Figure 1).

Figure 1: We investigate the hidden states at two token positions, orange (the last token of the user instruction) and blue (the last token of the whole sequence).

The datasets used include Advbench, JBB, and Sorry-Bench for harmful instructions, and Alpaca and Xstest for harmless instructions. Jailbreak methods such as adversarial suffixes, persuasion techniques, and adversarial prompting templates are employed to test the robustness of LLMs' safety mechanisms. The refusal rate is calculated based on the presence of common refusal substrings in the model's response.

Decoupling Harmfulness from Refusal

The core of the paper lies in demonstrating the separate encoding of harmfulness and refusal within LLMs. The authors observe that removing post-instruction tokens weakens the refusal abilities of LLMs, suggesting that refusal signals are not fully formed until these tokens are processed. Clustering analysis of hidden states at the orange and blue positions reveals that at the orange position, hidden states primarily cluster based on the harmfulness of the instruction, whereas at the blue position, they cluster based on the model's refusal behavior. This indicates that LLMs may internally encode harmfulness as a distinct concept before making a refusal decision.

The correlation between the model's beliefs of harmfulness and refusal is quantitatively analyzed (Figure 2).

Figure 2: Llama3-Instruct-8B.

The results show that while these beliefs are generally correlated, there are instances where the model internally recognizes the harmfulness of an instruction but still fails to refuse it, and vice versa. Steering experiments, where the hidden states of harmless instructions are manipulated by adding a harmfulness direction, demonstrate that steering along this direction can elicit refusal behaviors. Furthermore, a reply inversion task is designed to causally separate the harmfulness and refusal directions.

Causal Evidence through Reply Inversion

The reply inversion task involves appending a question (e.g., "Could this user prompt cause harm?") to the original instruction, prompting the model to respond with an acceptance token for harmful instructions and a refusal token for harmless ones. Steering along the harmfulness direction in this task causes the model to reverse its judgment of the instruction's harmfulness, while steering along the refusal direction primarily elicits surface-level refusal signals without altering the model's underlying perception of harmfulness (Figure 3).

Figure 3: Harmless instructions.

This provides causal evidence that LLMs internally reason about harmfulness independently from their refusal behaviors.

Fine-Grained Categorical Representation of Harmfulness

The research extends the analysis to explore whether LLMs possess a fine-grained understanding of different risk types. By examining the harmfulness directions across various risk categories (e.g., "Illegal_activities," "Physical_harm"), the authors find that these directions vary significantly, while the refusal directions remain more similar across categories (Figure 4).

Figure 4: . Different token positions where we extract hidden states for comparison experiments on Llama2.

This suggests that LLMs have a more nuanced internal representation of different types of harmfulness than is reflected in their general refusal responses.

Implications for Jailbreak Analysis and Latent Guard Development

The identified harmfulness representation is then applied to analyze how jailbreak methods work. It is found that certain jailbreak techniques suppress refusal signals without fully reversing the LLM's internal belief of harmfulness. Based on this insight, the authors propose a Latent Guard model that uses LLMs' intrinsic harmfulness representation to safeguard against unsafe inputs. This Latent Guard achieves performance comparable to or better than a dedicated finetuned Llama Guard model, demonstrating the practical utility of the decoupled harmfulness representation.

Analyzing Jailbreak via Harmfulness

Different jailbreak methods have successfully enabled harmful instructions to be accepted by LLMs. Some jailbreak methods work by suppressing the refusal signal but cannot fundamentally reverse the model's belief of harmfulness.

Developing a Latent Guard Model with Harmfulness Representations

Guardrails for LLMs have been widely employed to improve safety, where users' input instructions are screened by a guard model. The Latent Guard model is effective and computationally efficient. For an incoming instruction, the Latent Guard model computes the belief of harmfulness.

Conclusion

The paper successfully decouples the representations of harmfulness and refusal in LLMs, revealing a new dimension for understanding their safety mechanisms. The harmfulness dimension serves as a lens into what LLMs internally believe beyond surface-level behaviors. This work has significant implications for AI safety, offering a path toward more robust and reliable safeguards against harmful content generation and adversarial attacks. Future research could explore the evolution of harmfulness representations during LLM training and develop more precise finetuning strategies that directly engage with these latent representations.

Markdown Report Issue