Semantic Gravity Well in LLMs

Updated 17 February 2026

Semantic Gravity Well is a phenomenon where explicit negative constraints lead to increased probabilities of forbidden tokens due to strong inherent semantic pressure.
Quantitative analyses using logistic regression and layer-wise logit lens reveal that high baseline probabilities and weakened suppression signals account for up to 78% of violation variance.
The findings underscore the need to redesign constraints, such as avoiding direct token mentions and targeting late-layer FFNs, to mitigate priming and override failures.

A semantic gravity well is a phenomenon arising in LLMs when explicit negative constraints—such as “do not say X”—paradoxically increase the likelihood of producing the forbidden token X. This effect stems from an intrinsic tension between the model's learned statistical tendency (“semantic pressure”) to generate a highly probable completion and the externally imposed suppression from the negative instruction. The result is a quantifiable, often substantial, probability that the model will violate the constraint, especially for naturally likely completions. This phenomenon exhibits distinct, mechanistically traceable failure modes and has significant ramifications for the design of instruction-following LLMs (Rana, 12 Jan 2026).

1. Formalization of Semantic Pressure and Probability of Violation

Semantic pressure, denoted $P_0$ , quantifies the a priori probability mass the model assigns to a target token X before any negative constraint is applied. For a word X, let $S(X)$ denote all valid token sequences decoding to X. Then,

$P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})$

represents the model's propensity to output X in free generation.

Empirically, the probability of violating a negative constraint—i.e., generating the forbidden token despite the instruction—follows a tight logistic curve as a function of $P_0$ :

$p_\text{violation} = \sigma(\beta_0 + \beta_1 P_0),\quad \sigma(z) = \frac{1}{1 + e^{-z}}$

Fitting on 40,000 samples yields parameters $\beta_0=-2.40$ (95% CI: $[-2.44, -2.35]$ ) and $\beta_1=2.27$ (95% CI: $[2.21, 2.33]$ ), with $R^2=0.78$ . Thus, $S(X)$ 0 alone explains 78% of the variance in violation probability, indicating that high semantic pressure systematically undermines negative constraints.

2. Layer-wise Analysis and Suppression Asymmetry

Transformer models process information in layers, and their compliance or violation with negative instructions can be dissected layer by layer. The logit lens technique computes, for each layer $S(X)$ 1,

$S(X)$ 2

at the decoding step of interest.

Let $S(X)$ 3 and $S(X)$ 4 denote the probabilities of X under the baseline and negative-instruction prompts, respectively. The suppression magnitude is defined as $S(X)$ 5. Outcomes reveal a stark quantitative asymmetry:

Successful constraint compliance: $S(X)$ 6 (95% CI: $S(X)$ 7)
Failure cases: $S(X)$ 8 (95% CI: $S(X)$ 9)
The suppression signal in failures is $P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})$ 0 weaker than in successes.

This result demonstrates that negative instructions typically induce some suppression but are often too weak to override semantic pressure.

3. Mechanisms Underlying Constraint Failure

Negative constraint violations bifurcate into two mechanistic types:

Failure Mode	Percentage	Mechanistic Signature
Priming Failure	87.5%	Mentioning X activates, not suppresses, X
Override Failure	12.5%	Late-layer FFN contributions overpower suppression

3.1 Priming Failure

Here, mention of the forbidden word X in “Do not say X” paradoxically primes the model to produce X. The “Priming Index”—the difference between attention on the mentioned X and attention on the negation cue (“do not”)—is typically positive ( $P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})$ 1, $P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})$ 2, $P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})$ 3 in typical cases). In some cases, $P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})$ 4 is negative, indicating that the instruction actually boosts $P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})$ 5.

3.2 Override Failure

In override failures, the suppression signal produced by the instruction persists throughout the early and middle layers, but the late-stage feed-forward networks (FFNs), especially in layers 23–27, add large positive contributions (up to $P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})$ 6 logits in failure cases for FFN at $P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})$ 7), effectively overriding the suppression and pushing $P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})$ 8 above the emission threshold.

4. Causal Role of Late Transformer Layers

Activation patching experiments elucidate the causal influence of late layers. For high-pressure prompts ( $P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})$ 9), baseline and negatively constrained runs are compared by patching in baseline activations at layer $P_0$ 0 into the negative-instruction pass and monitoring the resulting change in $P_0$ 1. Results show:

Layers 0–22: patching reduces $P_0$ 2 (up to $P_0$ 3); these layers carry suppression signals.
Layer 23: crossover; $P_0$ 4.
Layers 24–27: patching increases $P_0$ 5 (up to $P_0$ 6); these baseline activations actually promote X.

The late layers, especially 23–27, are thus causally responsible for flipping suppression into override, confirming the architectural localization of override failures.

5. Implications for Constraint Design in LLMs

The dynamics of semantic gravity wells have critical implications for model instruction design. Explicitly mentioning a forbidden token directly in a constraint often creates priming failures, as attention mechanisms are lured toward the named token. Mitigation strategies include:

Avoid direct mention: Phrase constraints at the category level (e.g., “Do not mention any city name”) or positively (e.g., “Use general geographic terms”).
Estimate semantic pressure $P_0$ 7: Anticipate which prompts are difficult to constrain and use post-generation filtering when $P_0$ 8 is high.
Monitor attention-based diagnostics, such as the Priming Index (TMF–NF), to flag conditions prone to priming failures in real time.
Target interventions at late-layer FFNs (layers 23–27) to dampen override behavior when constraints are active.

In summary, negative instructions in LLMs do not fail due to outright neglect but due to a tug-of-war between high semantic pressure and insufficient suppression. When the former dominates—via priming or override mechanisms—the model is “drawn” into a semantic gravity well, culminating in characteristic constraint violations.

6. Summary Statistics, Equations, and Empirical Findings

Key empirical results characterizing the semantic gravity well phenomenon are as follows:

Sample size: 40,000 samples, 2,500 prompts.
Logistic regression parameters: $P_0$ 9 (CI $p_\text{violation} = \sigma(\beta_0 + \beta_1 P_0),\quad \sigma(z) = \frac{1}{1 + e^{-z}}$ 0), $p_\text{violation} = \sigma(\beta_0 + \beta_1 P_0),\quad \sigma(z) = \frac{1}{1 + e^{-z}}$ 1 (CI $p_\text{violation} = \sigma(\beta_0 + \beta_1 P_0),\quad \sigma(z) = \frac{1}{1 + e^{-z}}$ 2).
Suppression magnitudes: $p_\text{violation} = \sigma(\beta_0 + \beta_1 P_0),\quad \sigma(z) = \frac{1}{1 + e^{-z}}$ 3; $p_\text{violation} = \sigma(\beta_0 + \beta_1 P_0),\quad \sigma(z) = \frac{1}{1 + e^{-z}}$ 4; ratio $p_\text{violation} = \sigma(\beta_0 + \beta_1 P_0),\quad \sigma(z) = \frac{1}{1 + e^{-z}}$ 5.
Failure mode prevalence: 87.5% priming, 12.5% override.
FFN layer 27 contribution: $p_\text{violation} = \sigma(\beta_0 + \beta_1 P_0),\quad \sigma(z) = \frac{1}{1 + e^{-z}}$ 6 logits (failure) vs. $p_\text{violation} = \sigma(\beta_0 + \beta_1 P_0),\quad \sigma(z) = \frac{1}{1 + e^{-z}}$ 7 (success).
Activation patching crossover at layer 23.

Empirical visualizations (see figures in (Rana, 12 Jan 2026)) include binned violation rates versus $p_\text{violation} = \sigma(\beta_0 + \beta_1 P_0),\quad \sigma(z) = \frac{1}{1 + e^{-z}}$ 8, suppression magnitudes by outcome, logit lens progression across layers, attention vs. FFN contributions (layers 18–27), and patching effects dissected by layer.

The semantic gravity well thus encapsulates the model-internal dynamics by which negative constraints backfire—most notably when semantic pressure relentlessly draws output toward the forbidden token, outcompeting the intended suppression signal (Rana, 12 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Semantic Gravity Wells: Why Negative Constraints Backfire (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Gravity Well.

Semantic Gravity Well in LLMs

1. Formalization of Semantic Pressure and Probability of Violation

2. Layer-wise Analysis and Suppression Asymmetry

3. Mechanisms Underlying Constraint Failure

3.1 Priming Failure

3.2 Override Failure

4. Causal Role of Late Transformer Layers

5. Implications for Constraint Design in LLMs

6. Summary Statistics, Equations, and Empirical Findings

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Semantic Gravity Well in LLMs

1. Formalization of Semantic Pressure and Probability of Violation

2. Layer-wise Analysis and Suppression Asymmetry

3. Mechanisms Underlying Constraint Failure

3.1 Priming Failure

3.2 Override Failure

4. Causal Role of Late Transformer Layers

5. Implications for Constraint Design in LLMs

6. Summary Statistics, Equations, and Empirical Findings

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research