Extracting Rule-based Descriptions of Attention Features in Transformers (2510.18148v1)

Published 20 Oct 2025 in cs.CL and cs.LG

Abstract: Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper advocates for a different solution: rule-based descriptions that match token patterns in the input and correspondingly increase or decrease the likelihood of specific output tokens. Specifically, we extract rule-based descriptions of SAE features trained on the outputs of attention layers. While prior work treats the attention layers as an opaque box, we describe how it may naturally be expressed in terms of interactions between input and output features, of which we study three types: (1) skip-gram rules of the form "[Canadian city]... speaks --> English", (2) absence rules of the form "[Montreal]... speaks -/-> English," and (3) counting rules that toggle only when the count of a word exceeds a certain value or the count of another word. Absence and counting rules are not readily discovered by inspection of exemplars, where manual and automatic descriptions often identify misleading or incomplete explanations. We then describe a simple approach to extract these types of rules automatically from a transformer, and apply it to GPT-2 small. We find that a majority of features may be described well with around 100 skip-gram rules, though absence rules are abundant even as early as the first layer (in over a fourth of features). We also isolate a few examples of counting rules. This paper lays the groundwork for future research into rule-based descriptions of features by defining them, showing how they may be extracted, and providing a preliminary taxonomy of some of the behaviors they represent.

Summary

The paper introduces a novel methodology that uses symbolic, rule-based descriptions to interpret transformer attention features.
It employs skip-gram, absence, and counting rules to systematically capture input-output patterns in GPT-2 small.
Results indicate that while primitive rules explain early-layer behavior, more complex interactions in later layers demand sophisticated analysis.

Rule-based Descriptions of Attention Features in Transformers

"Extracting Rule-based Descriptions of Attention Features in Transformers" (2510.18148) presents a novel methodology for interpreting the attention mechanisms in transformer LLMs through rule-based descriptions, moving beyond manual inspection of exemplars. This approach focuses on deriving symbolic rules for attention features, offering an inherently interpretable model of how certain patterns of input tokens correspond to specific output behaviors.

Mechanistic Interpretability and Sparse Autoencoders

Mechanistic interpretability aims to elucidate model behavior by analyzing low-level primitives. Transformers, widely adopted for sequence modeling, consist of attention layers and involved operations such as query-key interactions. Traditional approaches identify active features by observing exemplars with high activations, but manual inspection yields subjective interpretations. Sparse Autoencoders (SAEs), by decomposing hidden states into feature vectors, have facilitated the discovery of circuits underlying specific model traits [bricken2023monosemanticity].

Figure 1: Given an attention layer in a transformer LLM (left), the task is to express each output feature as an explicit function of input features (center), described globally in terms of formal rules (right).

Rule-based Framework for Attention Layers

The paper proposes rule types—skip-gram, absence, and counting rules—to describe attention layer computations. Skip-gram rules (e.g., "[Canadian city] … speaks → English") promote output tokens based on token patterns, while absence rules (e.g., "[Montreal] … speaks ⊬ English") suppress expected tokens. Counting rules trigger outputs only when specific counts of inputs are met. The extraction of such rules replaces exemplar-based subjective interpretations with symbolic and human-readable descriptions.

Figure 2: Attention rules can take different forms based on input feature interactions, leading to skip-gram rules (left), absence rules (center), and counting rules (right).

Extraction Methodology and Application to GPT-2 Small

The authors develop a procedural pipeline to extract these symbolic rules automatically, applied to GPT-2 small—a prominent transformer model. Notably, skip-gram rules adequately describe early-layer attention features using concise patterns, while intricate interactions in later layers necessitate more complex rules like distractor suppression and counting operations.

Figure 3: Average precision and recall metrics in predicting feature activations based on extracted rules.

Implications and Future Directions

Rule-based descriptions facilitate a deeper understanding of model behavior, bridging the gap between model stimuli and decisions. Although primitive rules provide a good approximate model for early layers, the complexity in higher layers suggests a need for more sophisticated analyses. Future work could extend this framework to integrate rule-based analysis with more advanced feature decompositions, improving our grasp of model computations across all layers.

Figure 4: Prevalence of inputs with distractor keys, showing how absence rules occur and affect output feature activations.

Counting Rules and Intricate Feature Interactions

The paper explores the ability of attention heads to implement counting rules, where the activation depends on the frequency of specific input features. Such patterns are observed in the first layer of GPT-2 small, showcasing the head's capability of complex arithmetic operations such as balancing parentheses—a testament to the richness of attention mechanisms in capturing hierarchical sequence patterns [yao2021self].

Figure 5: Sequences that activate an attention feature from head 10 in layer 0 in GPT-2 small indicate a counting rule.

Conclusion

The research outlines a paradigm shift in understanding transformer models by formalizing attention features as rule-based entities. It paves the way for enhancing mechanistic interpretability and developing interpretable models. Addressing challenges in processing higher-layer complexities will be key to advancing this approach further into comprehensive LLM understanding.

This work lays foundational groundwork for forthcoming research, potentially leading to breakthroughs in explaining neural network behavior in transparent and human-interpretable terms.