Measuring Feature Sparsity in Language Models (2310.07837v2)

Published 11 Oct 2023 in cs.LG

Abstract: Recent works have proposed that activations in LLMs can be modelled as sparse linear combinations of vectors corresponding to features of input text. Under this assumption, these works aimed to reconstruct feature directions using sparse coding. We develop metrics to assess the success of these sparse coding techniques and test the validity of the linearity and sparsity assumptions. We show our metrics can predict the level of sparsity on synthetic sparse linear activations, and can distinguish between sparse linear data and several other distributions. We use our metrics to measure levels of sparsity in several LLMs. We find evidence that LLM activations can be accurately modelled by sparse linear combinations of features, significantly more so than control datasets. We also show that model activations appear to be sparsest in the first and final layers.

References (48)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces novel metrics, including normalized loss and average coefficient norm, to quantify feature sparsity in language model activations.
It applies these metrics to synthetic data and transformer models like BERT and GPT, revealing significant layer-wise sparsity variations.
The findings enhance model interpretability and offer practical insights for developing robust, explainable neural architectures.

An Analysis of Feature Sparsity in LLMs

The paper "Measuring Feature Sparsity in LLMs" by Deng, Tao, and Benton offers a quantitative exploration into the sparsity and linearity assumptions within transformer-based LLMs. The paper is grounded in the hypothesis that model activations can be represented as sparse linear combinations of input text features, providing a theoretical framework for interpreting these activations and suggesting a systematic method for assessing the validity of this assumption. By introducing robust metrics and applying them to various LLMs, the paper advances the methodology of understanding internal model representations.

The authors propose several novel metrics for measuring the effectiveness of sparse coding techniques in reconstructing feature directions. These metrics include normalized loss and average coefficient norm, which are designed to gauge the sparsity levels in neural network activations reliably. The work specifically addresses two fundamental assumptions: the linear representation hypothesis, stipulating that neural activations can be expressed linearly in terms of feature vectors, and the sparsity hypothesis, which asserts that only a subset of features are active for a given input.

Empirically, the paper demonstrates the utility of their metrics by applying them to synthetic datasets with known sparsity levels. The authors show that normalized loss and average coefficient norm metrics can accurately predict the degree of sparsity in synthetic data, outperforming other considered metrics, such as the number of non-zero entries. The experiments further validate the metrics’ ability to distinguish between data generated from a sparse linear process and other non-sparse structures. Importantly, these findings underpin the subsequent analysis of real-world LLMs, suggesting that sparse coding methodology can meaningfully decompose and interpret model activations.

Applying their findings to transformer models like BERT and GPT variations, the authors reveal insights about the distribution of feature sparsity across model layers. It is observed that embedding layers exhibit significant sparsity, aligning with the hypothesis that initial model layers capture fundamental linguistic features. Interestingly, sparsity levels vary across layers, with a noted increase in final layers, potentially reflecting a filtering process concentrating key predictive features.

The implications of this research extend both theoretical and practical domains. Theoretically, it bolsters the understanding of neural representations' linearity and sparsity, aligning with observations made in neuron-functionality studies. Practically, it opens avenues for enhanced model interpretability and debugging, which are critical for deploying models in sensitive real-world applications. The findings could guide the development of more interpretable network architectures and inspire further work on automated feature extraction based on sparse representations.

Looking forward, these results prompt questions about the integration of sparse coding techniques into model training processes, potentially enabling models to learn more disentangled and interpretable features innately. Enhancing interpretability by understanding internal representations could also help in aligning AI behaviors with desired or ethical outcomes, a key challenge in the expanding deployment of LLMs. The methodology and metrics introduced in this paper provide a rigorous framework for future research aiming to decipher and enhance model interpretability in more complex, real-world datasets and architectures.