Semantic Gravity Well in LLMs
- Semantic Gravity Well is a phenomenon where explicit negative constraints lead to increased probabilities of forbidden tokens due to strong inherent semantic pressure.
- Quantitative analyses using logistic regression and layer-wise logit lens reveal that high baseline probabilities and weakened suppression signals account for up to 78% of violation variance.
- The findings underscore the need to redesign constraints, such as avoiding direct token mentions and targeting late-layer FFNs, to mitigate priming and override failures.
A semantic gravity well is a phenomenon arising in LLMs when explicit negative constraints—such as “do not say X”—paradoxically increase the likelihood of producing the forbidden token X. This effect stems from an intrinsic tension between the model's learned statistical tendency (“semantic pressure”) to generate a highly probable completion and the externally imposed suppression from the negative instruction. The result is a quantifiable, often substantial, probability that the model will violate the constraint, especially for naturally likely completions. This phenomenon exhibits distinct, mechanistically traceable failure modes and has significant ramifications for the design of instruction-following LLMs (Rana, 12 Jan 2026).
1. Formalization of Semantic Pressure and Probability of Violation
Semantic pressure, denoted , quantifies the a priori probability mass the model assigns to a target token X before any negative constraint is applied. For a word X, let denote all valid token sequences decoding to X. Then,
represents the model's propensity to output X in free generation.
Empirically, the probability of violating a negative constraint—i.e., generating the forbidden token despite the instruction—follows a tight logistic curve as a function of :
Fitting on 40,000 samples yields parameters (95% CI: ) and (95% CI: ), with . Thus, alone explains 78% of the variance in violation probability, indicating that high semantic pressure systematically undermines negative constraints.
2. Layer-wise Analysis and Suppression Asymmetry
Transformer models process information in layers, and their compliance or violation with negative instructions can be dissected layer by layer. The logit lens technique computes, for each layer ,
at the decoding step of interest.
Let and denote the probabilities of X under the baseline and negative-instruction prompts, respectively. The suppression magnitude is defined as . Outcomes reveal a stark quantitative asymmetry:
- Successful constraint compliance: (95% CI: )
- Failure cases: (95% CI: )
- The suppression signal in failures is weaker than in successes.
This result demonstrates that negative instructions typically induce some suppression but are often too weak to override semantic pressure.
3. Mechanisms Underlying Constraint Failure
Negative constraint violations bifurcate into two mechanistic types:
| Failure Mode | Percentage | Mechanistic Signature |
|---|---|---|
| Priming Failure | 87.5% | Mentioning X activates, not suppresses, X |
| Override Failure | 12.5% | Late-layer FFN contributions overpower suppression |
3.1 Priming Failure
Here, mention of the forbidden word X in “Do not say X” paradoxically primes the model to produce X. The “Priming Index”—the difference between attention on the mentioned X and attention on the negation cue (“do not”)—is typically positive (, , in typical cases). In some cases, is negative, indicating that the instruction actually boosts .
3.2 Override Failure
In override failures, the suppression signal produced by the instruction persists throughout the early and middle layers, but the late-stage feed-forward networks (FFNs), especially in layers 23–27, add large positive contributions (up to logits in failure cases for FFN at ), effectively overriding the suppression and pushing above the emission threshold.
4. Causal Role of Late Transformer Layers
Activation patching experiments elucidate the causal influence of late layers. For high-pressure prompts (), baseline and negatively constrained runs are compared by patching in baseline activations at layer into the negative-instruction pass and monitoring the resulting change in . Results show:
- Layers 0–22: patching reduces (up to ); these layers carry suppression signals.
- Layer 23: crossover; .
- Layers 24–27: patching increases (up to ); these baseline activations actually promote X.
The late layers, especially 23–27, are thus causally responsible for flipping suppression into override, confirming the architectural localization of override failures.
5. Implications for Constraint Design in LLMs
The dynamics of semantic gravity wells have critical implications for model instruction design. Explicitly mentioning a forbidden token directly in a constraint often creates priming failures, as attention mechanisms are lured toward the named token. Mitigation strategies include:
- Avoid direct mention: Phrase constraints at the category level (e.g., “Do not mention any city name”) or positively (e.g., “Use general geographic terms”).
- Estimate semantic pressure : Anticipate which prompts are difficult to constrain and use post-generation filtering when is high.
- Monitor attention-based diagnostics, such as the Priming Index (TMF–NF), to flag conditions prone to priming failures in real time.
- Target interventions at late-layer FFNs (layers 23–27) to dampen override behavior when constraints are active.
In summary, negative instructions in LLMs do not fail due to outright neglect but due to a tug-of-war between high semantic pressure and insufficient suppression. When the former dominates—via priming or override mechanisms—the model is “drawn” into a semantic gravity well, culminating in characteristic constraint violations.
6. Summary Statistics, Equations, and Empirical Findings
Key empirical results characterizing the semantic gravity well phenomenon are as follows:
- Sample size: 40,000 samples, 2,500 prompts.
- Logistic regression parameters: (CI ), (CI ).
- Suppression magnitudes: ; ; ratio .
- Failure mode prevalence: 87.5% priming, 12.5% override.
- FFN layer 27 contribution: logits (failure) vs. (success).
- Activation patching crossover at layer 23.
Empirical visualizations (see figures in (Rana, 12 Jan 2026)) include binned violation rates versus , suppression magnitudes by outcome, logit lens progression across layers, attention vs. FFN contributions (layers 18–27), and patching effects dissected by layer.
The semantic gravity well thus encapsulates the model-internal dynamics by which negative constraints backfire—most notably when semantic pressure relentlessly draws output toward the forbidden token, outcompeting the intended suppression signal (Rana, 12 Jan 2026).