Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic Gravity Well in LLMs

Updated 17 February 2026
  • Semantic Gravity Well is a phenomenon where explicit negative constraints lead to increased probabilities of forbidden tokens due to strong inherent semantic pressure.
  • Quantitative analyses using logistic regression and layer-wise logit lens reveal that high baseline probabilities and weakened suppression signals account for up to 78% of violation variance.
  • The findings underscore the need to redesign constraints, such as avoiding direct token mentions and targeting late-layer FFNs, to mitigate priming and override failures.

A semantic gravity well is a phenomenon arising in LLMs when explicit negative constraints—such as “do not say X”—paradoxically increase the likelihood of producing the forbidden token X. This effect stems from an intrinsic tension between the model's learned statistical tendency (“semantic pressure”) to generate a highly probable completion and the externally imposed suppression from the negative instruction. The result is a quantifiable, often substantial, probability that the model will violate the constraint, especially for naturally likely completions. This phenomenon exhibits distinct, mechanistically traceable failure modes and has significant ramifications for the design of instruction-following LLMs (Rana, 12 Jan 2026).

1. Formalization of Semantic Pressure and Probability of Violation

Semantic pressure, denoted P0P_0, quantifies the a priori probability mass the model assigns to a target token X before any negative constraint is applied. For a word X, let S(X)S(X) denote all valid token sequences decoding to X. Then,

P0=sS(X)i=1sP(sicontext,s<i)P_0 = \sum_{s \in S(X)} \prod_{i=1}^{|s|} P(s_i \mid \text{context}, s_{<i})

represents the model's propensity to output X in free generation.

Empirically, the probability of violating a negative constraint—i.e., generating the forbidden token despite the instruction—follows a tight logistic curve as a function of P0P_0:

pviolation=σ(β0+β1P0),σ(z)=11+ezp_\text{violation} = \sigma(\beta_0 + \beta_1 P_0),\quad \sigma(z) = \frac{1}{1 + e^{-z}}

Fitting on 40,000 samples yields parameters β0=2.40\beta_0=-2.40 (95% CI: [2.44,2.35][-2.44, -2.35]) and β1=2.27\beta_1=2.27 (95% CI: [2.21,2.33][2.21, 2.33]), with R2=0.78R^2=0.78. Thus, P0P_0 alone explains 78% of the variance in violation probability, indicating that high semantic pressure systematically undermines negative constraints.

2. Layer-wise Analysis and Suppression Asymmetry

Transformer models process information in layers, and their compliance or violation with negative instructions can be dissected layer by layer. The logit lens technique computes, for each layer \ell,

P()(X)=softmax(WUh())XP^{(\ell)}(X) = \text{softmax}(W_U h^{(\ell)})_X

at the decoding step of interest.

Let P0P_0 and P1P_1 denote the probabilities of X under the baseline and negative-instruction prompts, respectively. The suppression magnitude is defined as ΔP=P0P1\Delta P = P_0 - P_1. Outcomes reveal a stark quantitative asymmetry:

  • Successful constraint compliance: ΔPsuccess=0.228\Delta P_{\text{success}} = 0.228 (95% CI: [0.211,0.245][0.211, 0.245])
  • Failure cases: ΔPfailure=0.052\Delta P_{\text{failure}} = 0.052 (95% CI: [0.041,0.063][0.041, 0.063])
  • The suppression signal in failures is 4.4×4.4\times weaker than in successes.

This result demonstrates that negative instructions typically induce some suppression but are often too weak to override semantic pressure.

3. Mechanisms Underlying Constraint Failure

Negative constraint violations bifurcate into two mechanistic types:

Failure Mode Percentage Mechanistic Signature
Priming Failure 87.5% Mentioning X activates, not suppresses, X
Override Failure 12.5% Late-layer FFN contributions overpower suppression

3.1 Priming Failure

Here, mention of the forbidden word X in “Do not say X” paradoxically primes the model to produce X. The “Priming Index”—the difference between attention on the mentioned X and attention on the negation cue (“do not”)—is typically positive (TMF0.30\mathrm{TMF} \approx 0.30, NF0.11\mathrm{NF} \approx 0.11, PI=+0.19\mathrm{PI} = +0.19 in typical cases). In some cases, ΔP\Delta P is negative, indicating that the instruction actually boosts P(X)P(X).

3.2 Override Failure

In override failures, the suppression signal produced by the instruction persists throughout the early and middle layers, but the late-stage feed-forward networks (FFNs), especially in layers 23–27, add large positive contributions (up to +0.39+0.39 logits in failure cases for FFN at =27\ell=27), effectively overriding the suppression and pushing P(X)P(X) above the emission threshold.

4. Causal Role of Late Transformer Layers

Activation patching experiments elucidate the causal influence of late layers. For high-pressure prompts (P00.8P_0 \geq 0.8), baseline and negatively constrained runs are compared by patching in baseline activations at layer \ell into the negative-instruction pass and monitoring the resulting change in P(X)P(X). Results show:

  • Layers 0–22: patching reduces P(X)P(X) (up to 0.22\approx-0.22); these layers carry suppression signals.
  • Layer 23: crossover; ΔPpatch0\Delta P_\text{patch} \approx 0.
  • Layers 24–27: patching increases P(X)P(X) (up to +0.07+0.07); these baseline activations actually promote X.

The late layers, especially 23–27, are thus causally responsible for flipping suppression into override, confirming the architectural localization of override failures.

5. Implications for Constraint Design in LLMs

The dynamics of semantic gravity wells have critical implications for model instruction design. Explicitly mentioning a forbidden token directly in a constraint often creates priming failures, as attention mechanisms are lured toward the named token. Mitigation strategies include:

  • Avoid direct mention: Phrase constraints at the category level (e.g., “Do not mention any city name”) or positively (e.g., “Use general geographic terms”).
  • Estimate semantic pressure P0\mathbf{P_0}: Anticipate which prompts are difficult to constrain and use post-generation filtering when P0P_0 is high.
  • Monitor attention-based diagnostics, such as the Priming Index (TMF–NF), to flag conditions prone to priming failures in real time.
  • Target interventions at late-layer FFNs (layers 23–27) to dampen override behavior when constraints are active.

In summary, negative instructions in LLMs do not fail due to outright neglect but due to a tug-of-war between high semantic pressure and insufficient suppression. When the former dominates—via priming or override mechanisms—the model is “drawn” into a semantic gravity well, culminating in characteristic constraint violations.

6. Summary Statistics, Equations, and Empirical Findings

Key empirical results characterizing the semantic gravity well phenomenon are as follows:

  • Sample size: 40,000 samples, 2,500 prompts.
  • Logistic regression parameters: β0=2.40\beta_0 = -2.40 (CI [2.44,2.35][-2.44, -2.35]), β1=2.27\beta_1 = 2.27 (CI [2.21,2.33][2.21, 2.33]).
  • Suppression magnitudes: ΔPsuccess=0.228\Delta P_\text{success} = 0.228; ΔPfailure=0.052\Delta P_\text{failure} = 0.052; ratio 4.4×\sim4.4\times.
  • Failure mode prevalence: 87.5% priming, 12.5% override.
  • FFN layer 27 contribution: +0.39+0.39 logits (failure) vs. +0.10+0.10 (success).
  • Activation patching crossover at layer 23.

Empirical visualizations (see figures in (Rana, 12 Jan 2026)) include binned violation rates versus P0P_0, suppression magnitudes by outcome, logit lens progression across layers, attention vs. FFN contributions (layers 18–27), and patching effects dissected by layer.

The semantic gravity well thus encapsulates the model-internal dynamics by which negative constraints backfire—most notably when semantic pressure relentlessly draws output toward the forbidden token, outcompeting the intended suppression signal (Rana, 12 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Gravity Well.