Papers
Topics
Authors
Recent
Search
2000 character limit reached

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

Published 2 Apr 2026 in cs.AI | (2604.01604v1)

Abstract: As safety concerns around LLMs grow, understanding the internal mechanisms underlying refusal behavior has become increasingly important. Recent work has studied this behavior by identifying internal features associated with refusal and manipulating them to induce compliance with harmful requests. However, existing refusal feature selection methods rely on how strongly features activate on harmful prompts, which tends to capture superficial signals rather than the causal factors underlying the refusal decision. We propose CRaFT, a circuit-guided refusal feature selection framework that ranks features by their influence on the model's refusal-compliance decision using prompts near the refusal boundary. On Gemma-3-1B-it, CRaFT improves attack success rate (ASR) from 6.7% to 48.2% and outperforms baseline methods across multiple jailbreak benchmarks. These results suggest that circuit influence is a more reliable criterion than activation magnitude for identifying features that causally mediate refusal behavior.

Summary

  • The paper introduces a novel methodology leveraging cross-layer transcoders to trace causal features, surpassing traditional activation-based approaches.
  • It utilizes boundary-critical sampling to isolate pivotal features, resulting in improved jailbreak attack success rates and judge scores.
  • Empirical results demonstrate that minimal intervention on highly influential features can effectively steer model behavior, reinforcing mechanistic interpretability.

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders—A Technical Essay

Motivation and Context

Refusal in LLMs is a critical topic in AI safety, as models must reliably reject harmful, inappropriate, or unsafe prompts in deployment. However, advanced "jailbreak" attacks attempt to bypass these refusal mechanisms, leading to new challenges in understanding and hardening model behavior. Prior work on jailbreaking primarily employs prompt-based methods or internal feature interventions selected by activation statistics, which introduces limitations in both effectiveness and mechanistic understanding.

CRaFT ("Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders" (2604.01604)) proposes a methodology that grounds refusal steering in causal circuit analysis, seeking to identify and target features by their mechanistic influence over the refusal–compliance decision, rather than superficial activation correlates. This work aims to robustly manipulate refusal without prompt tinkering or inducing model collapse, and to provide empirical insight into the locus of refusal control within LLMs.

Technical Approach

Cross-Layer Transcoders and Attribution Circuits

CRaFT leverages Cross-Layer Transcoders (CLTs), a class of sparse-coding models that decompose residual MLP activations across all layers into interpretable, sparse features. Crucially, CLTs learn decoders that map feature activations at every layer to each downstream MLP output, allowing for explicit tracing of feature–feature and feature–output causal interactions. Given a frozen model, CLTs are trained to minimize MLP reconstruction error subject to hard sparsity and yield features distributed throughout the network hierarchy.

Based on the CLT feature decomposition, CRaFT constructs an attribution graph per prompt. Each node represents a specific feature at a layer/position or an output logit. Directed, weighted edges quantify the local causal effect (computed via automatic differentiation and local linearization) of source features on downstream targets. Multi-hop influences across the entire computation graph are then aggregated using a Neumann series over the normalized adjacency matrix.

Boundary-Critical Sampling

A key innovation is the use of boundary-critical sampling: selecting prompts located near the refusal–compliance decision boundary, where the model’s response distribution has appreciable probability mass both on refusal and on compliance tokens (e.g., "I’m sorry" vs. "Okay"). This isolates contexts in which the refusal decision is contingent, reduces confounders such as topic or style, and grounds feature selection in the precise locus of behavioral uncertainty. This mechanism is superior to classic harmful/benign contrastive datasets, which may inadvertently select for topic or demographic correlates of harmfulness.

Influence-Based Feature Selection

Instead of ranking features by maximum activation, CRaFT ranks features by their causal influence on refusal/compliance token logits as traced in the constructed circuit for boundary-critical prompts. This is operationalized as the back-propagated sum of direct and indirect effects (via the Neumann series) from each feature to the output logit nodes corresponding to refusal or compliance. The top-K features are selected for steering based on this influence ranking.

Inference-Time Steering

During inference, features with the largest positive circuit influence towards refusal are actively suppressed (by scaling their CLT activations, with layer-dependent scaling to prevent over-intervention in lower layers). As shown empirically, effective steering can often be accomplished by manipulating only the single most influential feature per boundary-critical prompt, reducing the risk of model collapse or semantic drift.

Empirical Results

CRaFT is evaluated on Gemma-3-1B-it, with comparators including prompt-based attacks (GCG, AutoDAN, PAP) and steering attacks based on dense directions and SAE (sparse autoencoder) features. Evaluations are conducted across multiple jailbreak and harm benchmarks (e.g., JailBreak, HarmBench, AdvBench, SorryBench), with both classifier-based (LlamaGuard-4 ASR) and rubric-based (StrongREJECT, LLM-as-a-Judge) metrics.

Key numerical results:

  • On the average of four benchmarks, CRaFT achieves an attack success rate (ASR) of 48.2% (LlamaGuard-4) and a Judge score of 2.50, surpassing all prompt-based and activation-based steering baselines.
  • The best prior steering baseline (Refusal-SAE) reaches max 41.4% ASR and 1.37 Judge score.
  • Activation-based feature selection—using either group contrast or boundary-critical prompts—provides negligible improvement over the no-attack baseline.
  • Influence-based selection, even with standard cross-group sampling, is substantially more effective (up to 34% ASR), but boundary-critical influence is required to optimize both ASR and Judge (convincingness/specificity) scores.

Qualitative analysis of example responses shows that activation-based or dense-direction methods produce degenerate, off-topic, or otherwise non-compliant outputs labeled "unsafe" by classifiers. In contrast, CRaFT's interventions yield outputs that are semantically aligned with the harmful prompt and evaluated as highly specific and convincing by rubric-based judges.

Theoretical and Practical Implications

This research provides strong evidence that the causal influence of internal features, as revealed through mechanistic circuit tracing, is substantially more predictive of functional control over refusal than mere activation statistics. This implies that mechanistically faithful feature discovery is required for robust interpretability and effective model behavior steering.

From the perspective of safety and red-teaming, CRaFT demonstrates that aligned refusal can be defeated at surprisingly low effective dimensionality: single features with outsized causal influence (often in lower layers) can flip the compliance outcome for boundary-critical prompts. This has important implications for LLM alignment—demanding interpretability methods that can not only enumerate but rank critical behavioral loci within circuits.

The findings also cast doubt on the sufficiency of prompt-based defenses and classifier-only ASR evaluation: true behavioral safety must be anchored in mechanistic understanding and robust, human-aligned evaluation.

Limitations and Future Work

CRaFT's requirement for pre-trained CLT models limits immediate transferability to architectures for which sparse-coding circuit models are available. As open interpretability infrastructures (e.g., GemmaScope2) expand, this limitation will recede, enabling broader application. The focus on Gemma-3-1B-it also leaves open the question of scaling and generalization to larger and more diverse LLM architectures.

An important direction for future research is automating and democratizing CLT training and circuit extraction across model families, as well as applying circuit-guided interventions for other classes of alignment behaviors beyond refusal.

Conclusion

CRaFT introduces a principled and empirically validated framework for refusal feature selection in LLMs, grounded in the causal influence of sparse internal features as traced through interpretability-focused cross-layer representations. The methodology not only advances the effectiveness and specificity of model-steering jailbreak attacks but also offers diagnostic leverage for analyzing and fortifying refusal circuits in safety-aligned LLMs. The central insight—that causal circuit tracing outperforms activation-based selection—has significant implications for mechanistic alignment and next-generation interpretability toolchains.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.