- The paper introduces output-centric techniques that link feature activations with LLM outputs via efficient methods like VocabProj and TokenChange.
- It demonstrates that output-based evaluations reveal causal feature effects more accurately than traditional input-centric analyses.
- Combining input- and output-centric approaches enhances interpretability, even uncovering latent 'dead' features with significant output impacts.
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
The paper under consideration addresses a significant challenge in natural language processing: interpreting the internal representations of LLMs. These models contain numerous features that manifest as neurons or other atomic units, whose roles in text generation are not explicitly clear. Traditional automated interpretability pipelines have primarily focused on input-centric approaches to generate natural language descriptions of these features. However, these approaches are costly and may not accurately reflect how these features influence model outputs.
Overview of the Problem
LLMs encode concepts as real-valued vectors. Interpreting these features requires understanding both the inputs that activate them and their effects on outputs. Current input-centric methods involve tracking feature activations across large datasets, which is resource-intensive and may not account for the causal effects of features on model behavior, potentially leading to inconsistent or misleading interpretations.
Proposed Method
To address these issues, the authors propose output-centric methods that focus on the causal effects of features on model outputs. These methods are computationally efficient and aim to capture the relevant transformations that occur within the model's architecture:
- Vocabulary Projection (VocabProj): This method projects the feature vector onto the vocabulary space using the unembedding matrix, identifying tokens most impacted by the feature. It requires a simple matrix multiplication, making it highly efficient.
- Token Change (TokenChange): This approach identifies changes in token probabilities within model outputs when specific features are stimulated. Unlike VocabProj, it involves minor causal interventions in the model to determine feature effects.
These methods are evaluated against existing input-centric methods, particularly MaxAct, which identifies inputs that strongly activate a feature.
Evaluation
The paper presents an evaluation framework using two complementary metrics:
- Input-based Evaluation: Tests how well a feature description predicts activating inputs. This method is comparative and follows current evaluation paradigms.
- Output-based Evaluation: Assesses how accurately a feature's description captures its impact on model outputs. This involves selective amplification of features and comparing resultant outputs against descriptions.
The results indicate that while input-centric methods like MaxAct perform better in input-based evaluations, they are outperformed by output-centric methods in output-focused scenarios. The researchers find that using a combination of both approaches yields the most comprehensive and meaningful feature descriptions.
Empirical Findings
Across multiple LLM architectures and feature types, integrating input and output-centric methods (e.g., Ensemble approaches) surpasses individual methods in both evaluation metrics. This hybrid approach leverages the strengths of both activating examples and causal output analysis to provide a holistic understanding of feature roles.
Interestingly, output-centric methods offered a powerful mechanism to identify and describe previously "dead" features, which do not activate under normal circumstances but do affect model outputs when appropriately stimulated.
Implications and Future Directions
The paper introduces a paradigm shift from the widely adopted input-centric interpretability to a more balanced or even output-biased interpretation. Theoretically, this aligns more closely with causal effects understanding, thereby offering a direct link between feature activations and model outputs—vital for practical applications such as model steering or fine-tuning for specific tasks.
Looking forward, these insights could lead AI researchers to develop hybrid interpretability frameworks that not only scale efficiently but also enhance our understanding of latent spaces in LLMs. As the field progresses, such approaches could become integral to creating more transparent and controllable AI systems. Further exploration into optimization of ensemble methods and extending the methodology to non-vocabulary-based features would be beneficial.