Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Published 9 Nov 2024 in cs.IR | (2411.06112v1)

Abstract: Explainable recommendation systems are important to enhance transparency, accuracy, and fairness. Beyond result-level explanations, model-level interpretations can provide valuable insights that allow developers to optimize system designs and implement targeted improvements. However, most current approaches depend on specialized model designs, which often lack generalization capabilities. Given the various kinds of recommendation models, existing methods have limited ability to effectively interpret them. To address this issue, we propose RecSAE, an automatic, generalizable probing method for interpreting the internal states of Recommendation models with Sparse AutoEncoder. RecSAE serves as a plug-in module that does not affect original models during interpretations, while also enabling predictable modifications to their behaviors based on interpretation results. Firstly, we train an autoencoder with sparsity constraints to reconstruct internal activations of recommendation models, making the RecSAE latents more interpretable and monosemantic than the original neuron activations. Secondly, we automated the construction of concept dictionaries based on the relationship between latent activations and input item sequences. Thirdly, RecSAE validates these interpretations by predicting latent activations on new item sequences using the concept dictionary and deriving interpretation confidence scores from precision and recall. We demonstrate RecSAE's effectiveness on two datasets, identifying hundreds of highly interpretable concepts from pure ID-based models. Latent ablation studies further confirm that manipulating latent concepts produces corresponding changes in model output behavior, underscoring RecSAE's utility for both understanding and targeted tuning recommendation models. Code and data are publicly available at https://github.com/Alice1998/RecSAE.

Abstract PDF HTML Chat (Pro)

Summary

The paper introduces RecSAE, a plug-in framework that uses a sparse autoencoder (SAE) to interpret the internal states of recommendation models without changing their original function.
RecSAE employs a sparse autoencoder with Top-K activation to extract distinct latent features from model activations, and uses a large language model to automatically identify and describe these concepts.
Experimental results demonstrate high reconstruction accuracy (NMSE < 0.055), minimal impact on recommendation performance (<1.25% change), and the ability to reveal and manipulate interpretable concepts that control model output.

The paper proposes a generalizable, plug‐in framework that leverages a sparse autoencoder (SAE) to probe and interpret the internal states of recommendation models without modifying their original functionality. The method, termed RecSAE, focuses on extracting monosemantic latent representations from the hidden activations of recommendation architectures (e.g., transformer-based models such as SASRec) by enforcing sparsity via a Top‑K activation mechanism.

The main contributions of the work are summarized as follows:

SAE Architecture and Sparsity Constraints
- The framework injects an SAE module into pre-trained recommendation models at strategic points (e.g., residual streams or user embedding layers) to capture the hidden state activations.
- A large latent space is constructed by scaling the original activation dimension (typically by a factor of 32), and the encoder output is sparsified through a Top‑K activation layer. This choice enforces a highly sparse activation pattern, favoring interpretability.
- An auxiliary loss term is introduced to address the “dead” latent problem by guiding inactive neurons (those that are consistently zero over an epoch) to update, ensuring a more uniform utilization of the latent capacity.
- The reconstruction loss, defined as $\mathcal{L} = \|x - \hat{x}\|_2^2$ , along with the auxiliary loss, are combined in a linear fashion to optimize the entire network.
- Here, $x$ denotes the original model activation, and $\hat{x}$ its reconstruction from the sparse latent representation.
Automated Concept Dictionary Construction and Verification
- After training, thousands of sparse latent units are extracted from the model activations. However, only a subset corresponds to monosemantic (i.e., single-concept) features.
- To automatically map these latent units to human-understandable concepts, the activation values are discretized into 10 intensity levels. For each latent, the framework selects top activated sequences from the test set.
- A LLM (specifically Llama-3-8B-instruct) is then employed to generate succinct, textual descriptions for each latent based on these activation sequences.
- A closed-loop verification procedure is incorporated: by using the constructed concept dictionary to predict activation levels on new item sequences, the method computes precision, recall, and a consolidated confidence score. This score reflects the consistency between the interpreted concept and the actual activations.
Experimental Validation and Ablation Studies
- Experiments are conducted on two datasets: Amazon Grocery and Gourmet Food and MovieLens 1M.
- The SAE module achieves a normalized mean-squared error below 0.055 and maintains dead latent ratios under 5%, indicating that the vast majority of the expanded latent space is effectively utilized.
- Downstream recommendation metrics (e.g., HR and NDCG at various cutoffs) show changes of less than 1.25% compared to the baseline models, demonstrating that the SAE reconstruction faithfully preserves the necessary information for prediction.
- A set of interpretable latent concepts are presented; for instance, a latent interpreted as detecting “Saffron” in the Amazon dataset controls the recommendation ranking, where 73% of outputs include saffron-related items in the top 10 when activated (in contrast to 12% when not scaled). Similar ablation on concepts like “Gluten-free” further confirms that modulating specific latent activations produces predictable shifts in model output.
Implications and Future Directions
- The paper shows that even pure ID-based recommendation models, trained solely on interaction logs without side semantic information, can encode latent concepts that are both meaningful and interpretable.
- The plug‑in nature of RecSAE facilitates post hoc interpretability analysis without requiring the model to be trained from scratch with built-in interpretability constraints.
- Future work is suggested to extend the framework to other types of recommendation scenarios (e.g., context-aware or content-based) and to improve the automated interpretation process via enhanced language modeling techniques and advanced dictionary learning methods.

The methodology offers a systematic way to not only elucidate the decision-making processes of recommendation models but also to introduce controlled modifications that can guide recommendation behavior. Such an approach is particularly valuable for improving transparency, diagnosing model biases, and enabling future developments in controllable recommendation systems.