Papers
Topics
Authors
Recent
2000 character limit reached

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Published 24 Mar 2025 in cs.CL | (2503.18878v2)

Abstract: Recent LLMs like DeepSeek-R1 have demonstrated state-of-the-art performance by integrating deep thinking and complex reasoning during generation. However, the internal mechanisms behind these reasoning processes remain unexplored. We observe reasoning LLMs consistently use vocabulary associated with human reasoning processes. We hypothesize these words correspond to specific reasoning moments within the models' internal mechanisms. To test this hypothesis, we employ Sparse Autoencoders (SAEs), a technique for sparse decomposition of neural network activations into human-interpretable features. We introduce ReasonScore, an automatic metric to identify active SAE features during these reasoning moments. We perform manual and automatic interpretation of the features detected by our metric, and find those with activation patterns matching uncertainty, exploratory thinking, and reflection. Through steering experiments, we demonstrate that amplifying these features increases performance on reasoning-intensive benchmarks (+2.2%) while producing longer reasoning traces (+20.5%). Using the model diffing technique, we provide evidence that these features are present only in models with reasoning capabilities. Our work provides the first step towards a mechanistic understanding of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning

Summary

  • The paper demonstrates that sparse autoencoders can decompose LLM activations into human-interpretable reasoning features.
  • It introduces ReasonScore, a novel metric with an entropy penalty, to quantify the link between reasoning-related vocabulary and model activations.
  • Steering experiments show that enhancing identified reasoning features improves performance on benchmarks by up to 13.4%.

Interpretability of Reasoning Features in LLMs via Sparse Autoencoders

Introduction

The paper "I Have Covered All the Bases Here: Interpreting Reasoning Features in LLMs via Sparse Autoencoders" (2503.18878) tackles the challenge of unraveling the internal reasoning processes of LLMs. Despite the remarkable advances in LLM capabilities, particularly in structured reasoning and problem-solving, the mechanistic underpinnings of these processes remain largely opaque. The authors employ Sparse Autoencoders (SAEs) to disentangle LLM activations into human-interpretable features and introduce a novel metric, ReasonScore, to identify features active during reasoning. Their work seeks to bridge the gap between LLM output and the internal processes facilitating complex reasoning.

Methodology

Sparse Autoencoders for Feature Decomposition

SAEs serve as the cornerstone of the research, enabling the decomposition of LLM activations into sparse, interpretable components. This approach presumes that reasoning processes can be mapped onto specific activation patterns within the model. The authors leverage SAEs to make sense of these activation patterns, hypothesizing that words commonly associated with human reasoning, such as "perhaps" or "alternatively", correspond to salient features in the model's activation space.

Development and Use of ReasonScore

The introduction of ReasonScore represents a pivotal advancement in quantifying the activity of reasoning-related features. This metric evaluates the degree to which specific SAE features correlate with a predefined vocabulary of reasoning words. ReasonScore incorporates an entropy penalty to ensure that activated features are not only frequent but also diverse across contexts, enhancing interpretability.

Experimental Validation and Steering

To validate their approach, the authors conduct both manual and automatic interpretation experiments on identified features. Steering experiments further reveal that amplifying reasoning features correlates with enhanced performance on reasoning-intensive benchmarks. Specifically, they report performance improvements of up to 13.4% on the AIME-2024 benchmark and increased reasoning trace lengths, demonstrating that the identified features are causally linked to reasoning behavior.

Interpretation and Evaluation

Manual and Automatic Interpretation

The research employs a rigorous evaluation regime involving both manual and automatic interpretation of SAE features. Manual interpretation focuses on identifying activation patterns corresponding to uncertainty, exploration, and reflection, utilizing feature interfaces for detailed examination. Automatic interpretation employs feature steering and GPT-4o to annotate features with semantic functions, clustering them into categories such as numerical accuracy and reasoning depth (Figure 1). Figure 1

Figure 1

Figure 1: Interpretability results for manually verified set of features in our SAE: (a) Examples of feature interfaces used in manual interpretation experiments, (b) Distribution of reasoning features on function groups obtained by automatic interpretation pipeline by using GPT-4o as a judge.

Steering Experiments

The steering experiments provide compelling evidence of the practical implications of their findings. By modulating the activation of certain features, the LLM can be guided to produce outputs with deeper reasoning and improved performance on specific tasks. Such experiments showcase the potential for targeted interventions in LLM activations to enhance or alter reasoning processes.

Theoretical and Practical Implications

This study offers significant theoretical insights into the interpretability of reasoning processes in LLMs. By demonstrating that reasoning capabilities can be traced back to specific, interpretable components within the model's activations, the authors provide a concrete framework for understanding the internal mechanics of these complex systems. Practically, the ability to steer LLM reasoning processes opens avenues for fine-tuning models to fit specific reasoning tasks, potentially improving the efficiency and applicability of LLMs across various domains.

Conclusion

The research conducted in this paper lays foundational groundwork for interpreting and manipulating reasoning processes in LLMs via Sparse Autoencoders. By isolating features corresponding to key reasoning behaviors, such as uncertainty and exploration, and employing ReasonScore as a metric, the authors offer a novel perspective on LLM internal mechanisms. Future work can expand upon these findings, exploring the transferability of these insights across different model architectures and integrating them into more applied settings to refine AI reasoning capabilities.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 11 likes about this paper.