Language Model Circuits Are Sparse in the Neuron Basis

Published 30 Jan 2026 in cs.CL and cs.AI | (2601.22594v1)

Abstract: The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). LLM interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $\to$ state $\to$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state'), and can be steered to change the model's output. This work thus advances automated interpretability of LLMs without additional training costs.

Abstract PDF Upgrade to Chat

Summary

The paper shows that MLP activations form a sparse, efficient basis for tracing circuits, outperforming traditional MLP outputs on subject-verb agreement tasks.
The authors employ the advanced RelP attribution method, which reduces representation size while maintaining high faithfulness and completeness metrics.
Their methodology generalizes across paired and unpaired datasets, enabling scalable, transparent, and cost-effective model interpretability.

Review of "LLM Circuits Are Sparse in the Neuron Basis" (2601.22594)

Introduction

The paper "LLM Circuits Are Sparse in the Neuron Basis" introduces a compelling paradigm shift in understanding the interpretability of neural network computations in transformer LLMs. The authors challenge the prevailing assumption that neuron circuits, particularly those in MLP activations, are denser and less interpretable compared to sparse autoencoders (SAEs). They propose an innovative framework that leverages neuron-based representations, arguing that MLP neurons provide a sparse and efficient feature basis for circuit tracing.

The study is grounded in methodologies that challenge previous baselines, employing advanced attribution methods like RelP over traditional techniques such as Integrated Gradients (IG). By focusing on model internal computations at the neuron level, the authors aim to advance the automation of interpretability without incurring additional computational costs.

Methodology

The authors apply their hypothesis to several benchmarks, notably the subject-verb agreement (SVA) benchmark, to empirically substantiate the sparsity and faithfulness of neuron-based representations in tracing model circuits. The proposition is evaluated over multiple dimensions of model representation, including MLP activations, MLP outputs, attention outputs, and residual streams.

A pivotal aspect of the methodology is the deployment of RelP, a novel gradient-based attribution method that demonstrated superior performance compared to IG. This method reduces computational overhead while maintaining high fidelity in attribution accuracy, closing critical gaps previously observed in neuron sparsity compared to SAE features.

Figure 1: Faithfulness and completeness for different choices of representation in the model (residual stream, attention, MLP activations, or MLP outputs) and basis (neurons or SAE) when applying Integrated Gradients, averaged over the 4 SVA tasks with paired data.

The experimental framework also explores representation choices through zero and mean ablation approaches with paired and unpaired dataset configurations, highlighting the robustness of neuron-based circuit tracing across diverse data settings.

Experimental Findings

The research presents four core results:

MLP Activations as a Sparse Basis: Empirical evaluation on SVA tasks reveals that circuits derived from MLP activations are noticeably smaller yet maintain robustness in faithfulness and completeness metrics, surpassing traditional MLP outputs.
Effectiveness of RelP: The RelP method systematically decreases the representation size needed for neuron-based circuits, offering significant improvements over IG across the investigated tasks (Figure 2).
Figure 2: Faithfulness and completeness for Integrated Gradients vs.~RelP, for different choices of representation in the model and basis (neurons or SAE), averaged over the 4 SVA tasks with paired data.
Generalizability to Unpaired Data: The proposed methods maintain performance advantages in unpaired data scenarios, underscoring their applicability to real-world datasets where counterfactual examples are unavailable (Figure 3).
Figure 3: Faithfulness and completeness for Integrated Gradients vs.~RelP, for different choices of representation in the model and basis (neurons or SAE), averaged over the 4 SVA tasks with unpaired data.
Comprehensive Edge Analysis: The study extends its scope to edge-based analysis, proving that neuron connections evaluated through RelP maintain high faithfulness, optimizing both node and edge gravity in circuits.

Case Studies and Implications

Significantly, the work explores case studies like multi-hop reasoning for state capitals, illustrating how neuron activation patterns can be deciphered to understand complex reasoning processes within models. This supports the potential for manipulating or steering targeted model behaviors effectively, opening pathways for practical oversight and dynamic model interaction.

Figure 4: Schematic of 23 key neurons in the 'Texas' circuit, their interactions, and the effects of steering groups of them.

The implications extend beyond interpretability; the proposed neuron-based approach facilitates a scalable, efficient way to probe deep learning models without re-training or additional overheads. This not only democratizes model debugging and oversight but also provides a foundation for developing AI systems that can be fine-tuned or adjusted post-deployment based on specific task requirements.

Conclusion

This paper makes significant strides in computational interpretability within AI, proposing a tractable, neuron-centric framework that balances the granularity and accessibility of feature space analysis in transformers. By advocating for the utility of MLP activations over conventional SAE structures, the authors pave the way for more transparent and controllable AI systems, potentially influencing future model training and interpretability research paradigms.

Markdown