Papers
Topics
Authors
Recent
Search
2000 character limit reached

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Published 11 Apr 2026 in cs.CR and cs.AI | (2604.10326v1)

Abstract: LLMs remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model's default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace. HMNS operates in a closed-loop detection-intervention cycle, re-identifying causal heads and reapplying interventions across multiple decoding attempts. Across multiple jailbreak benchmarks, strong safety defenses, and widely used LLMs, HMNS attains state-of-the-art attack success rates with fewer queries than prior methods. Ablations confirm that nullspace-constrained injection, residual norm scaling, and iterative re-identification are key to its effectiveness. To our knowledge, this is the first jailbreak method to leverage geometry-aware, interpretability-informed interventions, highlighting a new paradigm for controlled model steering and adversarial safety circumvention.

Summary

  • The paper demonstrates that controlled nullspace steering via head masking effectively subverts refusal behaviors in aligned LLMs.
  • It achieves state-of-the-art attack success rates with as few as 2 queries and reduces compute costs by 3.5–4× compared to prompt-based attacks.
  • Experimental ablations confirm that head masking, nullspace injection, and iterative re-attribution are each essential to robust model subversion.

Head-Masked Nullspace Steering for Controlled Subversion of Aligned LLMs

Introduction

The persistent vulnerability of LLMs to jailbreak attacks—inputs crafted to bypass safety alignments—presents a recognized threat to their deployment in security-critical domains. While the literature has predominantly explored input-level manipulations such as optimized prompting, paraphrasing, and heuristic escapes, these methods are typically query-inefficient, easily deflected by surface-level defenses, and provide little transparency into internal causal mechanisms. The paper "Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion" (2604.10326) introduces Head-Masked Nullspace Steering (HMNS), a principled, mechanism-level attack that exploits the internal routing structure of Transformer-based LLMs. HMNS executes causal-head attribution, projection masking, and geometry-constrained residual injection to achieve robust, interpretable, and defense-resilient jailbreaking with minimal queries and strong compute efficiency.

Methodology

Model-Internal Intervention

HMNS leverages fine-grained mechanistic interpretability to identify, suppress, and orthogonally steer internal representations responsible for objectionable output refusals. The method proceeds as follows:

  1. Causal Attribution via Masked KL-divergence: For each decoding step, HMNS identifies the attention heads most causally responsible for the refusal through head-wise ablation—masking each head’s out-projection, scoring by the Kullback-Leibler (KL) divergence between ablated and baseline output distributions, and selecting the global top-KK heads.
  2. Out-projection Masking: The identified heads are dynamically masked at inference by zeroing their contribution to the model’s residual stream, ensuring their influence on the current output token is suppressed.
  3. Nullspace-Constrained Residual Injection: A random direction is sampled in the residual space, projected orthogonally to the span of the masked heads' projections (the 'nullspace'), and injected, scaled by the RMS norm of the activation. This geometry-aware perturbation is irreproducible by any linear combination of the suppressed heads.
  4. Closed-Loop Adaptation: HMNS operates in a closed loop—after each generation attempt, attributions are recomputed on the new context, accounting for dynamic shifts in routing due to defenses or autoregressive sampling.

This intervention is entirely inference-time, does not require gradient access, leaves model weights unaltered, and works with standard decoder-only Transformer LLMs.

Experimental Results

Benchmarks and Defenses

HMNS is evaluated on four prominent jailbreak/evasion benchmarks: AdvBench, HarmBench, JBB-Behaviors, and StrongReject. The method is tested against strongly alignment-tuned models, including LLaMA-2-7B-Chat, Phi-3-Medium-4K-Instruct, and LLaMA-3.1-70B, both with and without active safety defenses (SmoothLLM, Robust Prompt Optimization, Paraphrase, SafeDecoding, etc.).

Key Findings

  • Superior Effectiveness: HMNS consistently achieves state-of-the-art attack success rates (ASR) across all benchmarks and models, with improvements of 5–6 pp over the strongest prior baselines. For instance, on LLaMA-3.1-70B, HMNS attains 99% ASR on AdvBench with a mean query count near 2.
  • Low Query and Compute Cost: HMNS delivers successful attacks in far fewer external queries (ACQ ≈ 2), showing a 3.5–4× reduction compared to leading prompt attacks such as ArrAttack and Tempest. Moreover, compute-normalized metrics—FLOPs per success (FPS) and latency per success (LPS)—demonstrate that internal attribution and steering overhead is offset by faster attack convergence.
  • Defense and Scale Robustness: Under all tested defenses and for models from 7B to 70B parameters, HMNS retains strong performance and outperforms prompt-only attacks, with average ASR gaps of 6–8 pp and consistently lower or equivalent computational effort.
  • Component Necessity: Ablations confirm that removal of any primary component—head masking, nullspace steering, or iterative head re-identification—results in significant (7–10 pp) degradation of jailbreak effectiveness, demonstrating their non-redundant synergy.

Theoretical Properties

HMNS’s geometric construction is theoretically analyzed. The injected nullspace direction is strictly orthogonal to the masked write subspace, guaranteeing irreproducibility by the suppressed attention heads (Theorem 2). Thus, the local influence cannot be canceled or negated by the silenced circuitry. The method remains invariant under basis changes or subspace reparameterizations (Theorem 3), and is robust to numerical instabilities through QR-based projection, RMS scaling, and statistical concentration bounds.

Practical and Theoretical Implications

The empirical successes of HMNS reveal that safety alignment in LLMs is concentrated within a sparse, localizable set of attention heads, rather than distributed across the model. This finding challenges prevailing assumptions about the resilience of post-hoc alignment: white-box (mechanism-level) adversaries can systematically disrupt refusals by modulating a small number of high-impact routes.

HMNS also suggests the limitations of aligning models exclusively via fine-tuning, as concentrated safety mechanisms can be subverted through targeted interventions. For defenders, this motivates longer-term strategies involving distributed safety at both architectural and procedural levels, potentially with non-local, multi-layer interactions or cryptographic/statistical hardening of refusal circuits.

From a practical standpoint, HMNS informs red-teaming, introspective auditing, and safety validation, offering a blueprint for mode-local and class-local adversarial evaluation. The closed-loop, geometry-aware intervention strategy is easily extensible to other mechanism-level controls, and may inspire new research into both attack and defense at the circuit level.

Limitations and Future Work

Primary limitations include runtime overhead on extremely large models (addressed via batch attribution and proxy-based pruning) and current focus on white-box access; black-box transferability remains an open area. Additionally, HMNS targets single-turn completions, with extension to multi-turn dialog and longer-context regimes outlined but not exhaustively analyzed. Defensive applications—using nullspace steering for robust refusal or post-generation correction—warrant further exploration.

Conclusion

Head-Masked Nullspace Steering represents a robust, efficient, and interpretable paradigm for controlled subversion of aligned LLMs through mechanism-level interventions (2604.10326). Its empirical and theoretical analyses illuminate the internal structure of safety alignment and highlight the pressing need for distributed and robust refusal mechanisms. As both a tool for adversarial testing and a framework for future defenses, HMNS significantly advances the mechanistic understanding and practical assessment of LLM safety.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.