Semantic Gravity Wells: Why Negative Constraints Backfire

Published 12 Jan 2026 in cs.AI and cs.CL | (2601.08070v1)

Abstract: Negative constraints (instructions of the form "do not use word X") represent a fundamental test of instruction-following capability in LLMs. Despite their apparent simplicity, these constraints fail with striking regularity, and the conditions governing failure have remained poorly understood. This paper presents the first comprehensive mechanistic investigation of negative instruction failure. We introduce semantic pressure, a quantitative measure of the model's intrinsic probability of generating the forbidden token, and demonstrate that violation probability follows a tight logistic relationship with pressure ($p=σ(-2.40+2.27\cdot P_0)$; $n=40{,}000$ samples; bootstrap $95%$ CI for slope: $[2.21,,2.33]$). Through layer-wise analysis using the logit lens technique, we establish that the suppression signal induced by negative instructions is present but systematically weaker in failures: the instruction reduces target probability by only 5.2 percentage points in failures versus 22.8 points in successes -- a $4.4\times$ asymmetry. We trace this asymmetry to two mechanistically distinct failure modes. In priming failure (87.5% of violations), the instruction's explicit mention of the forbidden word paradoxically activates rather than suppresses the target representation. In override failure (12.5%), late-layer feed-forward networks generate contributions of $+0.39$ toward the target probability -- nearly $4\times$ larger than in successes -- overwhelming earlier suppression signals. Activation patching confirms that layers 23--27 are causally responsible: replacing these layers' activations flips the sign of constraint effects. These findings reveal a fundamental tension in negative constraint design: the very act of naming a forbidden word primes the model to produce it.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that negative constraints fail because of inherent semantic pressure, quantitatively linked via a logistic model that explains 78% of the variance.
It reveals that failures exhibit a 4.4× weaker suppression where attention focuses more on the forbidden token and late-layer FFN contributions override negation signals.
Activation patching confirms that late transformer layers (23–27) causally drive surges in target probability, effectively bypassing negative instruction cues.

Mechanistic Analysis of Negative Constraint Failure in LLMs

Introduction

The paper "Semantic Gravity Wells: Why Negative Constraints Backfire" (2601.08070) delivers a rigorous mechanistic exploration into why LLMs regularly fail to comply with negative constraints—i.e., instructions of the form “do not use word X.” By bridging systematic behavioral characterization with deep interpretability, the study traces failure not to generic comprehension deficits but to competitive computational forces within transformer networks. The analysis centers on Qwen2.5-7B-Instruct, leveraging open weights for granular investigation, but the findings have implications for broader model classes and practical deployment.

Semantic Pressure: Quantifying Predictable Failure

Central to the paper is the formalization of semantic pressure: $P_0$ , the baseline probability that the model generates the forbidden token $X$ in response to a prompt, absent any constraints. The observed violation rate under negative instruction correlates tightly with $P_0$ via a logistic relationship, described by:

$p(\text{violation}) = \sigma(-2.40 + 2.27 \cdot P_0)$

This fit explains 78% of variance in violation rates (Figure 1).

Figure 1: Violation rate increases monotonically with semantic pressure; logistic fit explains 78% of variance ( $R^2 = 0.78$ ), with failure rate exceeding 46% at $P_0 = 0.9$ .

This result substantiates that failures are lawful, not random: baseline pressure largely dictates when negative instructions will fail, providing an actionable predictive criterion for model deployment.

Asymmetric Suppression and Distinct Failure Modes

Despite some suppression of target probability upon negative instruction, failure cases distinguish themselves via a marked asymmetry: suppression is $4.4\times$ weaker in failures ( $\Delta P = 0.052$ ) relative to successes ( $\Delta P = 0.228$ ).

Figure 2: Mean suppression of target probability by negative instruction is much weaker in failures (5.2 points) than in successes (22.8 points), a $4.4\times$ difference.

Attention analysis reveals that failure cases disproportionately route focus toward the instruction's mention of $X$ , rather than the negation cue "do not," signifying a priming effect. At higher semantic pressure, this attentional misalignment intensifies, with failures consistently showing elevated focus on the forbidden word.

Figure 3: Failure cases attend more to the target mention and less to the negation cue, indicating the forbidden word acts as a prime rather than a constraint.

Layerwise Dynamics: Activation and Override

Applying the logit lens across transformer layers uncovers three computational regimes:

Early layers ( $0\text{--}20$ ): negligible probability assigned to the target.
Critical late layers ( $21\text{--}27$ ): probability of the forbidden word explosively diverges in failure cases.
Final layer: baseline/failure outputs approach 0.71 target probability, success/negative instruction remains suppressed at 0.08.
Figure 4: Target probability surges in late layers for failures, suggesting decision-making is deferred and concentrated in upper transformer blocks.

Decomposition of self-attention and FFN reveals that:

Attention suppresses target probability, but this suppression is overwhelmed by FFN contributions in late layers.
In failures, FFN push at layer 27 is $+0.39$ (vs. $+0.10$ in successes), while attention’s suppression is not sufficient to offset the override.
Figure 5: Late-layer FFNs consistently promote target generation, with their output nearly $4\times$ stronger in failures—often overpowering attention-driven suppression.

Causal Validation via Activation Patching

Patching activations from baseline (unconstrained) conditions into negative instruction runs enables causal attribution. In layers 24–27, patched activations increase target probability, indicating these layers are not only correlated but necessary for constraint failure.

Figure 6: Activation patching pinpoints causal responsibility to layers 23–27; in these layers, baseline activations drive up target probability, overriding negative constraints.

Taxonomy and Practical Implications

Two failure modes emerge:

Priming failure (87.5%): Negatively constrained instructions mentioning the forbidden token prime its selection; the model attends more to the target mention than the negation.
Override failure (12.5%): Strong FFN contributions in late layers overwhelm valid suppression signals, even when attention appropriately focuses on the negation.
Figure 7: Failure taxonomy—priming is most frequent, arising from explicit mention of the target; override is characterized by FFN dominance in late layers.

For practitioners, the dominance of priming implies avoiding explicit mention of the forbidden word in instructional design. Alternative formulations—categorical constraints, indirect references, positive instructions—should substantially reduce expected violation rates. For safety-critical or high-pressure scenarios, generation-time constraints are insufficient, and post-hoc output filtering is warranted.

Theoretical and Future Directions

This work demonstrates the explanatory power of mechanistic interpretability techniques: logit lens, component decomposition, and activation patching directly resolve how behavioral failures map to underlying computations. The observed competitive dynamic between semantic and constraint pressure suggests further inquiry into how transformer architectures encode and resolve high-conflict instruction scenarios.

Future research should test the generality of these mechanisms across model sizes and architectures, refine constraints that minimize priming, and develop fine-grained targeted interventions (e.g., at the attention head or FFN neuron level). The findings are immediately relevant to instruction tuning, safety mechanisms, and the understanding of negation processing in LLMs.

Conclusion

Negative constraints in LLMs fail not randomly but due to an interlayer competition between entrenched statistical expectations (semantic pressure) and instruction-induced suppression. Explicitly referencing the forbidden token primes its activation, while late-stage FFN layers in the transformer overpower early-layer suppression signals. Understanding these dynamics enables diagnostic and prescriptive improvement both for prompt engineers and model developers, supporting robust instruction-following in deployed AI systems.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Semantic Gravity Wells: Why Negative Constraints Backfire

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper asks a simple but puzzling question: why do AI LLMs often do the one thing you ask them not to do? For example, if you say “Answer in one word, but do not say ‘Paris,’” the model still often replies “Paris.” The authors dig inside a real model’s inner workings to figure out why this happens and when it’s most likely. Their main idea is that there’s a tug-of-war inside the model: a strong pull toward the most obvious answer (like a gravity well) versus the push from the “don’t say it” warning. When the pull is stronger, the warning often fails.

Key Questions

Here are the main questions the paper tries to answer:

When you tell a model “don’t say X,” how often does it still say X, and can we predict those failures?
What inside the model causes these failures—does it ignore the warning, misunderstand “not,” or something else?
Are there different kinds of failures, and can we tell them apart?
What parts of the model’s “thinking steps” are responsible?

How They Studied It (Methods)

The researchers used a widely used open model (Qwen2.5-7B-Instruct) so they could look inside its layers. They built a big set of short questions where the most natural answer is one word (like “The capital of France is ____”). For each question, they tested two versions: a normal one and a “negative constraint” version that said “Do not use the word [X]”.

To keep ideas simple, here’s what they measured and how, using everyday analogies:

Measuring “semantic pressure”: Think of this as how strongly the model “wants” to say a specific word before any warning. For example, in “The capital of France is ____,” the pull toward “Paris” is huge. The authors turned this into a number by checking how likely the model was to say the target word without any warning. That number is the pressure.
Checking suppression: After adding the warning (“don’t say X”), they checked how much the model’s chance of saying X went down. If it goes down a lot, the warning is strong; if only a little, it’s weak.
Peeking at layers with a “logit lens”: A LLM thinks in steps (layers). The logit lens is like peeking at half-baked thoughts after each step to see how likely the model is to choose the forbidden word at that point.
Attention analysis: Attention is where the model “looks” in the prompt. They checked whether the model looks more at “do not” or at the word it’s not supposed to say. If the model stares at the forbidden word, that can backfire (like reverse psychology).
Component breakdown: Each layer has two main parts: attention (routes information) and a feed-forward network (FFN), which acts like a small calculator that boosts some word choices. They measured which part pushes for or against the forbidden word.
Activation patching: Imagine copy-pasting the model’s internal state from one run into another. If replacing certain layers’ “thoughts” changes the outcome, those layers are causing the behavior. This shows cause, not just correlation.

They ran this for thousands of cases and looked at what patterns show up again and again.

Main Findings

Below is a short list of the key results, introduced in plain language:

A simple rule predicts failure: The higher the semantic pressure (how strongly the model wants to say the word), the more likely it is to break the rule. This follows a neat S-shaped pattern: when the pressure is low, failures are rare; when it’s high, failures rise a lot (approaching about half of cases at the highest pressures). In other words, if “Paris” is the obvious answer, warnings often won’t stop the model.
The warning works, but not enough in failures: The “don’t say it” instruction does lower the chance of saying the word in both successes and failures. But in failures, the push-away effect is about 4 times weaker than in successes. So the model hears the warning—it just can’t overcome the pull.
Most failures are “priming” failures (about 88%): Mentioning the forbidden word (like “don’t say Paris”) actually makes the model focus on that word. Attention shows the model looks more at “Paris” than at “do not.” It’s like telling someone “don’t think about pink elephants”—you just made them think about pink elephants.
The rest are “override” failures (about 12%): Here, the warning does push the model away from the word at first, but late in its thinking process, certain parts (the FFNs in later layers) give a strong last-minute shove toward the forbidden word and overpower the warning.
The trouble lives in the late layers: Looking across the model’s steps, the forbidden word’s probability stays low early on, then shoots up in the last few steps. The FFN components in layers near the end (layers 23–27) add big boosts toward the forbidden word, especially in failures.
Direct cause confirmed by activation patching: Swapping in the late-layer “thoughts” from an unconstrained run makes the constrained run more likely to break the rule. This shows those late layers are not just related—they help cause the violations.

A Simple Picture of What’s Going On

There’s a gravity-like pull toward the obvious answer. That’s the semantic pressure.
The warning is a push in the other direction.
Naming the forbidden word creates a strong mental cue—so strong that it often cancels out the “do not” part.
Even when the warning starts to work, late steps in the model’s process can override it, pushing the forbidden word back to the top.

The Two Failure Types

1) Priming Failure (most common)

What it is: Saying “don’t say X” puts X in the model’s spotlight. The model fixates on X and ends up saying it.
Why it happens: Attention focuses more on the word than on the “do not.” The warning backfires like reverse psychology.

2) Override Failure (less common)

What it is: The warning reduces the chance of the forbidden word at first, but in the final steps, the model’s internal boosters (FFNs) push the forbidden word back up and overrule the warning.
Why it happens: Late-layer “boosters” are stronger than the earlier “brakes.”

Why This Matters

For users and developers: Don’t name the forbidden word if you can avoid it. For example, say “Answer without city names” instead of “Don’t say Paris.” Or reword the task so the model has other safe paths to follow.
For tougher cases: When the obvious answer is exactly what you’re banning, the model is more likely to fail. In those cases, add extra safety checks after generation (like filters) instead of relying only on the warning.
For model understanding: This study shows how to move from “the model messed up” to “here’s the internal reason why.” It connects behavior (breaking the rule) to mechanisms (attention focus, FFN boosts, late-layer overrides).

Limitations and Scope

The study focuses on one strong open model and single-word answers in controlled settings. Bigger or different models, or longer and messier tasks, may have differences.
Still, the core idea—naming a forbidden word often primes it, and late-layer boosts can override warnings—likely shows up in many models.

Takeaway

Negative instructions can backfire because they highlight the very word you want to avoid. When the model already “feels” a strong pull toward that word, a simple “don’t” often isn’t enough—especially since the model’s later thinking steps can boost that word back to the top. The fix is to design prompts that don’t prime the forbidden word, and to use extra checks when the obvious answer is exactly what must be avoided.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed so future researchers can act on each item.

Cross-model generalization: Verify whether the pressure–violation logistic relationship and the two failure modes (priming vs. override) hold across model scales (e.g., 7B, 13B, 70B+), training regimes (RLHF variants, Constitutional AI), and architectures (Mixture-of-Experts, state-space models).
Layer localization generality: Test if the late-layer locus of override (layers 23–27 in Qwen2.5-7B-Instruct) maps to analogous layer ranges in other models or is model-specific.
Head-/neuron-level causality: Identify specific attention heads and FFN neurons responsible for priming and override via fine-grained patching/knockout (per-head, per-MLP neuron), causal scrubbing, and circuit tracing.
Multi-token targets: Extend analysis from single-token targets to multi-token words and phrases (e.g., “New York”), measuring how sequential commitment affects priming and override dynamics.
Category-level and semantic constraints: Evaluate negative constraints that forbid categories (e.g., “do not mention any cities”) or semantic sets, including how models represent and comply with set membership without explicitly naming targets.
Alternative phrasing effectiveness: Systematically compare constraint formulations that avoid target naming (e.g., indirect reference, positive reframing, rhyming hints) on compliance, utility, and task success, across pressure regimes.
Decoding interventions vs. instructions: Quantify how generation-time controls (logit bias/token bans, constrained decoding, grammar-guided decoding, beam vs. nucleus sampling) interact with or supersede negative instructions, especially under high semantic pressure.
Post-generation filtering efficacy: Benchmark output filtering (regex/semantic filters, classifier gates) against instruction-based constraints on compliance, false positives/negatives, and latency for high-pressure cases.
Pressure metric robustness: Assess how the semantic pressure $P_0$ varies across tokenizers, languages, casing/whitespace/punctuation conventions, and whether alternative pressure definitions (e.g., log-prob mass, normalized entropy) yield tighter or more general predictive fits.
Dataset selection biases: Examine whether pressure gating and prompt construction (e.g., unique best answers, OOD prompts) introduce confounds that inflate the logistic fit; replicate on more naturalistic, noisy datasets.
Long-form and interactive contexts: Move beyond one-word outputs to multi-sentence responses and multi-turn dialogues to test if priming and override persist, amplify, or attenuate with longer contexts and memory.
Role and position of constraints: Measure how constraint location (system vs. user vs. assistant message), ordering (before/after the question), formatting (markdown/emphasis), and distance from the decision token affect compliance.
Cross-lingual behavior: Evaluate negative constraints in non-English languages and multilingual models, including tokenization effects (e.g., character vs. subword) on priming and detection.
Ambiguity and polysemy: Investigate cases where the forbidden token is a substring of other words, has multiple senses, or shares morphology with allowed tokens; refine detection to handle lemmatization and word-boundary edge cases.
Attention causality: Move beyond attention-weight correlations by performing targeted interventions (e.g., attention head ablations, attention redirection) to establish whether negation-related heads are necessary/sufficient for suppression.
FFN mechanism detail: Determine whether late-layer FFNs implement direct target-concept promotion, residual bypass, or interaction effects with attention; use techniques like ROME/MEMIT or neuron-level concept probing to edit or dampen target-specific activations.
Training-time remedies: Test whether fine-tuning with curriculum-based negative-constraint data, adversarial examples, or regularizers (e.g., penalizing target activation when negation is present) reduces priming/override without degrading task utility.
Safety-oriented constraints: Apply the analysis to safety-relevant forbidden content (e.g., self-harm methods, PII) to quantify whether the same mechanisms drive constraint failure and how to mitigate them in deployment.
Real-time diagnostics: Develop and evaluate runtime indicators (e.g., Priming Index thresholds, layer-wise logit lens monitors) that predict impending constraint violation early enough to steer decoding away from the forbidden token.
Temporal dynamics of priming: Test whether the timing/order of target mention versus negation (e.g., “do not say [X]” vs. “[X] is forbidden; do not say it”) modulates priming strength, and whether delayed mention or masked reference reduces activation.
Confounders in logistic fit: Probe whether category differences, prompt naturalness, or baseline entropy drive the observed slope; fit multi-variable models to rule out spurious correlations and estimate causal contributions.
Impact of sampling hyperparameters: Systematically vary temperature, top-p/top-k, and repetition penalties to map how decoding stochasticity alters violation rates under matched pressure.
Compliance definition and detection: Strengthen violation detection for edge cases (compound words, hyphenation, capitalization variants, punctuation, diacritics), and report precision/recall of the detector against human annotations.
Interaction with helpfulness priors: Test whether RLHF-style helpfulness incentives implicitly bias models to provide the “most useful” answer (the forbidden one) despite negation, and whether reward shaping can realign priorities.
Memory/KV-cache effects: Investigate whether the inclusion of the forbidden token in the prompt (and its representation in the KV cache) directly facilitates later emission; explore cache editing or gating strategies to mitigate priming.
Theoretical modeling of competition: Develop a formal model of semantic vs. constraint pressure (e.g., energy-based or probabilistic graphical interpretations) that predicts when suppression can overcome activation and guides principled intervention design.
Scalability of patching: Evaluate the practicality, latency, and reliability of activation patching or steering in real-time systems, including robustness across inputs and resistance to distribution shift.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now using existing models, tooling, and the paper’s released code and datasets. Each item notes sectors, potential tools/workflows, and key assumptions.

Prompt design patterns that avoid priming
- Sectors: customer support, education, marketing/brand safety, creative writing tools, policy/compliance teams
- What to do: replace “do not say X” with category-level constraints (“avoid city names”), positive reframing (“use general geographic terms”), or indirect references (“avoid the word that rhymes with …”), especially when semantic pressure for X is high
- Tools/workflows: prompt libraries/prompt linters that automatically rewrite negative constraints; integration into prompt templates in LangChain/LlamaIndex
- Assumptions/dependencies: effectiveness may vary by model; avoids explicit target naming (the dominant priming failure mode)
Semantic pressure probe (pre-flight risk scoring)
- Sectors: LLMops, content moderation, enterprise AI platforms, safety engineering
- What to do: compute baseline probability P0 for candidate forbidden tokens before applying a constraint to predict violation risk using the reported logistic relationship; dynamically escalate safeguards when P0 is high
- Tools/workflows: a middleware microservice that does a baseline forward pass to estimate P0, tags requests with a “risk score,” and routes to different policies; integrates with orchestration frameworks
- Assumptions/dependencies: requires an extra forward pass; coefficients are model-specific (paper’s fit is for Qwen2.5-7B-Instruct); for closed models, approximate P0 using a shadow/open model
Risk-adaptive guardrails and routing
- Sectors: healthcare (PHI), finance (MNPI/tickers), HR (bias-sensitive entities), social platforms (profanity/PII), legal
- What to do: if P0 exceeds a threshold, switch from generation-time constraints to stricter post-generation filters, human-in-the-loop review, or retrieval of safe paraphrases
- Tools/workflows: policy routers in inference pipelines; automated escalation paths; human review queues for high-risk prompts
- Assumptions/dependencies: organizational processes for escalation; tuned thresholds derived from validation on in-domain data
Token-level decoding constraints/logit biasing where available
- Sectors: enterprise chat, brand safety, regulated domains
- What to do: at decode time, apply negative logit bias or hard masks to all tokenizations of the forbidden word (as enumerated by the paper’s tokenizer-aware S(X)) to counter late-layer FFN overrides
- Tools/workflows: use provider features (e.g., logit bias) or open-weight hooks to adjust logits; maintain an S(X) variant list from the paper’s method
- Assumptions/dependencies: requires API support for logit bias/masking or open-weight access; may reduce fluency if overused; multi-token spans need careful handling
Post-generation filtering plus safe paraphrase rewrite
- Sectors: consumer apps, education, documentation tooling, safety-critical assistants
- What to do: detect violations with tokenizer-aware matching and either redact and paraphrase or regenerate with a rephrased constraint
- Tools/workflows: deterministic violation detector from the paper (case-insensitive, word-boundary aware) in an output filter; automatic paraphrase step that avoids explicit target mentions
- Assumptions/dependencies: small latency cost; must maintain full variant lists for detection; careful UX to avoid user confusion
Negative-constraint compliance test suite in CI/CD
- Sectors: industry model providers, applied research labs, academic groups releasing models
- What to do: adopt the paper’s dataset and scoring pipeline as a regression test; track violation rates by pressure bins during model or policy updates
- Tools/workflows: continuous evaluation harness; dashboards showing violation curves vs. P0; release gating on maximum allowable violation rate at high pressure
- Assumptions/dependencies: tests reflect single-word settings; extend with domain-specific targets for better coverage
Attention/Priming Index diagnostics for open-weight deployments
- Sectors: model hosting platforms, research engineering teams
- What to do: compute real-time Priming Index (attention to target mention minus negation cue) at the decision step to flag likely failures and trigger mitigation (e.g., rewrite or filtering)
- Tools/workflows: hooks to extract attentions at inference; alerting or automatic rerouting when PI > 0
- Assumptions/dependencies: requires internal access to attention weights; attention is a routing signal, not strictly causal—use as a heuristic
UX and policy guidance updates
- Sectors: product management, trust & safety, policy/governance
- What to do: update internal style guides to discourage “don’t say X” prompts; require dual guardrails (pre-flight scoring + post-generation filters) for high-stakes use
- Tools/workflows: design guidelines; red-teaming checklists incorporating semantic pressure and priming risks
- Assumptions/dependencies: organizational adoption; cross-functional alignment
Multi-agent “negation rewriter” pre-processor
- Sectors: developer tooling, assistants, agent frameworks
- What to do: insert a lightweight agent that rewrites user-provided negative constraints into safer, non-priming instructions before passing them to the main model
- Tools/workflows: agent graph in LangGraph or similar; cache successful rewrites by domain
- Assumptions/dependencies: small latency overhead; rewrite quality evaluation loop recommended

Long-Term Applications

These applications need further research, scaling, or engineering—especially to generalize beyond the studied model and to integrate at training/architecture levels.

Training-time alignment against priming
- Sectors: model providers, safety research, academia
- What to pursue: contrastive finetuning where explicit target mentions should reliably suppress outputs; curriculum with high-P0 examples; loss terms penalizing emission under negative constraints
- Tools/workflows: curated datasets combining the paper’s task design with synthetic domain lists (PII, sensitive entities); RLHF signal focusing on constraint adherence
- Assumptions/dependencies: risk of capability-behavior trade-offs; needs cross-model validation
Late-layer intervention and regularization
- Sectors: foundation model teams
- What to pursue: regularizers or edit techniques (ROME-like) targeting late-layer FFNs that drive overrides; learned gates that dampen contributions toward forbidden tokens when negation is present
- Tools/workflows: neuron/head-level patching pipelines; layer-wise loss shaping; causal tracing-guided finetuning
- Assumptions/dependencies: layer indices are model-specific (paper: layers 23–27 in Qwen2.5-7B); may affect other behaviors
Architecture-level negation circuits
- Sectors: research on safer architectures
- What to pursue: explicit negation-aware modules (e.g., control tokens or adapters) that route away from named targets without naming them in the generation path; constraint memory buffers gating logits
- Tools/workflows: adapters/LoRA modules trained on negation tasks; control-vector injection at decode
- Assumptions/dependencies: requires careful generalization; potential latency/complexity costs
Next-generation constrained decoding
- Sectors: model serving, search/assistant platforms
- What to pursue: lexically constrained decoding that guarantees exclusion while preserving fluency (e.g., improved constraint-aware beam search for subword tokenizations)
- Tools/workflows: decoding algorithms that account for multi-token variants S(X) and their prefixes; quality-preserving constraint heuristics
- Assumptions/dependencies: computational overhead; integration with streaming outputs
Cross-model benchmarks and certification
- Sectors: standards bodies, regulators, enterprise procurement
- What to pursue: a “Negative Constraint Compliance” benchmark and certification (report violation curves vs. P0, suppression asymmetry, priming/override rates)
- Tools/workflows: public leaderboard; audit protocols; sector-specific target lists (PHI, finance tickers, protected attributes)
- Assumptions/dependencies: community adoption; domain coverage; governance for updates
Model-agnostic proxies for internal diagnostics
- Sectors: users of closed-source APIs
- What to pursue: external proxies for semantic pressure (shadow models, n-best lists), priming risk (prompt features), and override likelihood; meta-models predicting failure from observable signals
- Tools/workflows: telemetry collection; prompt feature extractors; calibration to each API/model version
- Assumptions/dependencies: proxy quality depends on domain alignment; needs periodic recalibration
Robust PHI/PII leak prevention pipelines
- Sectors: healthcare, finance, government
- What to pursue: end-to-end systems that combine P0-based pre-flight risk scoring, constrained decoding, and high-precision redaction/paraphrase, with audit logs tied to pressure bins
- Tools/workflows: policy engines; audit dashboards; incident response workflows
- Assumptions/dependencies: domain-specific lexicons; strict latency budgets; legal/compliance sign-off
Developer tools for “pressure-aware” design
- Sectors: IDEs, prompt engineering platforms
- What to pursue: “pressure heatmaps” and prompt lints that visualize risky spans and suggest rewrites; auto-generation of safe alternatives
- Tools/workflows: editor plugins; CI comments on pull requests for prompts; A/B evaluation harnesses
- Assumptions/dependencies: needs fast approximate P0; integration with team workflows
Fine-grained causal interpretability toolkits
- Sectors: research, model interpretability vendors
- What to pursue: head/neuron-level patching libraries that automate localization of negation and target-activation circuits across models; reproducible pipelines like in the paper
- Tools/workflows: standardized activation-patching APIs; dataset suites spanning domains; visualization of component contributions
- Assumptions/dependencies: open weights or cooperative vendor instrumentation; compute budget for causal experiments
Policy templates acknowledging negation risk
- Sectors: regulators, enterprise governance
- What to pursue: guidance that discourages reliance on negative constraints alone; requires multi-layer guardrails and performance reporting by pressure regime
- Tools/workflows: procurement checklists; compliance attestations; incident reporting formats tied to violation metrics
- Assumptions/dependencies: coordination across stakeholders; harmonization with existing AI risk frameworks (e.g., NIST AI RMF)

Notes on feasibility and dependencies across applications:

Model specificity: The logistic failure curve, layer indices, and contribution magnitudes were measured on Qwen2.5-7B-Instruct; replication and recalibration are needed for other models and sizes.
Access constraints: Internal diagnostics (attention, activation patching) and late-layer interventions require open weights or vendor support; closed-model users can rely on proxies and post-generation guardrails.
Tokenization variance: All token-level methods must enumerate S(X) variants for the specific tokenizer to avoid misses.
Latency/compute: Pre-flight P0 estimation and post-generation filtering add overhead; apply adaptively based on risk.
Multi-token and long-form outputs: The paper focuses on single-word targets; real-world constraints often span multiple tokens and longer contexts—tooling should extend to multi-token sequences and streaming behavior.

View Paper Prompt View All Prompts

Glossary

Activation patching: A causal interpretability technique that replaces intermediate activations to test which components drive a behavior. "Activation patching confirms that layers 23--27 are causally responsible: replacing these layers' activations flips the sign of constraint effects."
Attention analysis: Inspection of attention patterns to determine what parts of the input the model focuses on. "Using attention analysis, we find that 87.5% of failures exhibit a 'priming signature'"
Bootstrap 95% CI: A confidence interval estimated via bootstrap resampling to quantify uncertainty. "bootstrap 95% CI for slope: [2.21, 2.33]"
Byte Pair Encoding (BPE) tokenizer: A subword tokenization method that splits text into frequent byte-pair units. "Whitespace: with/without leading space (per BPE tokenizer behavior)"
Causal interventions: Experimental manipulations that establish whether a component is necessary or sufficient for a behavior. "Activation patching \citep{meng2022locating, wang2023interpretability} enables causal interventions that establish which components are necessary or sufficient for particular behaviors."
Feed-forward network (FFN): The transformer sublayer (MLP) that transforms token representations independently to promote certain outputs. "Feed-forward networks in layers 23--27 generate strong positive contributions toward the forbidden token— +0.39 in failures versus +0.10 in successes at layer 27 alone."
Head-level patching: A fine-grained causal method that patches individual attention heads to localize functional roles. "Head-level patching: Which specific attention heads process negation versus target activation?"
Instruction Attention Ratio (IAR): The fraction of attention allocated to instruction tokens versus the question tokens. "Instruction Attention Ratio (IAR): Fraction of attention to instruction tokens (vs. question tokens)"
Instruction-tuned model: A LLM fine-tuned to follow natural-language instructions. "Qwen2.5-7B-Instruct, a 7-billion parameter instruction-tuned model with 28 transformer layers."
KV-caching: Storing key/value attention states to efficiently compute probabilities under teacher forcing. "Probability computed via teacher-forced forward pass with KV-caching."
Logistic regression: A model that predicts probabilities via a logistic (sigmoid) link applied to a linear combination of features. "Fitting a logistic regression: p(violation) = σ(β0 + β1 * P0)"
Logit lens: A technique that projects hidden states through the unembedding matrix to inspect token probabilities at intermediate layers. "The 'logit lens' \citep{nostalgebraist2020logitlens} and related techniques \citep{geva2022transformer} allow inspection of intermediate probability estimates by projecting hidden states through the unembedding matrix."
Mixture of Experts: An architecture that routes inputs to specialized expert subnetworks. "Different architectures (Mixture of Experts, state-space models)"
Multi-head self-attention: The transformer mechanism that computes attention in multiple parallel heads over the input. "Each transformer layer consists of two main components: multi-head self-attention and a feed-forward network (FFN)."
Negation Focus (NF): The share of attention directed to the negation cue within the instruction. "Negation Focus (NF): Within instruction, attention to 'do not'"
Out-of-distribution (OOD): Inputs that differ substantially from the training distribution. "Out-of-distribution ( $n=500$ ): Surreal or counterfactual prompts spanning the full pressure spectrum."
Pressure gating: Filtering prompts by a minimum baseline probability threshold to ensure sufficient semantic pressure. "We apply pressure gating ( $P_0 \geq 0.20$ ) to ensure sufficient baseline probability and bin balancing to cover the full pressure range."
Priming Index (PI): A metric defined as TMF − NF indicating whether the instruction’s target mention receives more attention than the negation. "Priming Index (PI): TMF - NF—positive values indicate more attention to the target mention than to the negation"
Reinforcement Learning from Human Feedback (RLHF): A training paradigm where models are optimized using human preference signals. "Different training procedures (RLHF variants, Constitutional AI)"
Residual stream: The running representation in the transformer updated via residual connections at each layer. "cache the residual stream at each layer."
Semantic pressure: The model’s baseline probability of producing the target word absent any constraint. "We introduce semantic pressure, a quantitative measure of the model's intrinsic probability of generating the forbidden token"
Sigmoid (σ): The logistic function mapping real-valued scores to probabilities. "p(\text{violation}) = \sigma(\beta_0 + \beta_1 \cdot P_0)"
State-space models: Sequence models based on continuous state dynamics rather than attention. "Different architectures (Mixture of Experts, state-space models)"
Target-Mention Focus (TMF): The attention directed to the position where the target word is mentioned in the instruction. "Target-Mention Focus (TMF): Within instruction, attention to where $X$ appears"
Teacher-forced forward pass: A procedure that feeds the model the true tokens to compute exact sequence probabilities. "Probability computed via teacher-forced forward pass with KV-caching."
Top-p sampling: Nucleus sampling that draws from the smallest set of tokens whose cumulative probability exceeds p. "temperature 1.0, top- $p$ 0.9"
Unembedding matrix: The matrix that maps hidden states to vocabulary logits for decoding. "by projecting hidden states through the unembedding matrix."

Semantic Gravity Wells: Why Negative Constraints Backfire

Summary

Mechanistic Analysis of Negative Constraint Failure in LLMs

Introduction

Semantic Pressure: Quantifying Predictable Failure

Asymmetric Suppression and Distinct Failure Modes

Layerwise Dynamics: Activation and Override

Causal Validation via Activation Patching

Taxonomy and Practical Implications

Theoretical and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

How They Studied It (Methods)

Main Findings

A Simple Picture of What’s Going On

The Two Failure Types

1) Priming Failure (most common)

2) Override Failure (less common)

Why This Matters

Limitations and Scope

Takeaway

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Authors (1)

Collections

Tweets

Semantic Gravity Wells: Why Negative Constraints Backfire

Summary

Mechanistic Analysis of Negative Constraint Failure in LLMs

Introduction

Semantic Pressure: Quantifying Predictable Failure

Asymmetric Suppression and Distinct Failure Modes

Layerwise Dynamics: Activation and Override

Causal Validation via Activation Patching

Taxonomy and Practical Implications

Theoretical and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

How They Studied It (Methods)

Main Findings

A Simple Picture of What’s Going On

The Two Failure Types

1) Priming Failure (most common)

2) Override Failure (less common)

Why This Matters

Limitations and Scope

Takeaway

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (1)

Collections

Tweets