Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 105 tok/s Pro

GPT OSS 120B 471 tok/s Pro

Kimi K2 193 tok/s Pro

2000 character limit reached

Enhancing Automated Interpretability with Output-Centric Feature Descriptions (2501.08319v2)

Published 14 Jan 2025 in cs.CL

Abstract: Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in LLMs, such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model's representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary "unembedding" head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be "dead".

Collections

Summary

The paper introduces output-centric techniques that link feature activations with LLM outputs via efficient methods like VocabProj and TokenChange.
It demonstrates that output-based evaluations reveal causal feature effects more accurately than traditional input-centric analyses.
Combining input- and output-centric approaches enhances interpretability, even uncovering latent 'dead' features with significant output impacts.

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

The paper under consideration addresses a significant challenge in natural language processing: interpreting the internal representations of LLMs. These models contain numerous features that manifest as neurons or other atomic units, whose roles in text generation are not explicitly clear. Traditional automated interpretability pipelines have primarily focused on input-centric approaches to generate natural language descriptions of these features. However, these approaches are costly and may not accurately reflect how these features influence model outputs.

Overview of the Problem

LLMs encode concepts as real-valued vectors. Interpreting these features requires understanding both the inputs that activate them and their effects on outputs. Current input-centric methods involve tracking feature activations across large datasets, which is resource-intensive and may not account for the causal effects of features on model behavior, potentially leading to inconsistent or misleading interpretations.

Proposed Method

To address these issues, the authors propose output-centric methods that focus on the causal effects of features on model outputs. These methods are computationally efficient and aim to capture the relevant transformations that occur within the model's architecture:

Vocabulary Projection (VocabProj): This method projects the feature vector onto the vocabulary space using the unembedding matrix, identifying tokens most impacted by the feature. It requires a simple matrix multiplication, making it highly efficient.
Token Change (TokenChange): This approach identifies changes in token probabilities within model outputs when specific features are stimulated. Unlike VocabProj, it involves minor causal interventions in the model to determine feature effects.

These methods are evaluated against existing input-centric methods, particularly MaxAct, which identifies inputs that strongly activate a feature.

Evaluation

The paper presents an evaluation framework using two complementary metrics:

Input-based Evaluation: Tests how well a feature description predicts activating inputs. This method is comparative and follows current evaluation paradigms.
Output-based Evaluation: Assesses how accurately a feature's description captures its impact on model outputs. This involves selective amplification of features and comparing resultant outputs against descriptions.

The results indicate that while input-centric methods like MaxAct perform better in input-based evaluations, they are outperformed by output-centric methods in output-focused scenarios. The researchers find that using a combination of both approaches yields the most comprehensive and meaningful feature descriptions.

Empirical Findings

Across multiple LLM architectures and feature types, integrating input and output-centric methods (e.g., Ensemble approaches) surpasses individual methods in both evaluation metrics. This hybrid approach leverages the strengths of both activating examples and causal output analysis to provide a holistic understanding of feature roles.

Interestingly, output-centric methods offered a powerful mechanism to identify and describe previously "dead" features, which do not activate under normal circumstances but do affect model outputs when appropriately stimulated.

Implications and Future Directions

The paper introduces a paradigm shift from the widely adopted input-centric interpretability to a more balanced or even output-biased interpretation. Theoretically, this aligns more closely with causal effects understanding, thereby offering a direct link between feature activations and model outputs—vital for practical applications such as model steering or fine-tuning for specific tasks.

Looking forward, these insights could lead AI researchers to develop hybrid interpretability frameworks that not only scale efficiently but also enhance our understanding of latent spaces in LLMs. As the field progresses, such approaches could become integral to creating more transparent and controllable AI systems. Further exploration into optimization of ensemble methods and extending the methodology to non-vocabulary-based features would be beneficial.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (5)

Tweets

https://twitter.com/megamor2/status/1879589172547322143

https://twitter.com/NitCal/status/1948719903248093236

https://twitter.com/TheTuringPost/status/1881480901462524412

https://twitter.com/GptMaestro/status/1879506875252908324

https://twitter.com/Tim_Hua_/status/1891614774443561288