InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders (2412.12101v1)

Published 13 Nov 2024 in q-bio.BM, cs.AI, cs.LG, and q-bio.QM

Abstract: Protein LLMs (PLMs) have demonstrated remarkable success in protein modeling and design, yet their internal mechanisms for predicting structure and function remain poorly understood. Here we present a systematic approach to extract and analyze interpretable features from PLMs using sparse autoencoders (SAEs). By training SAEs on embeddings from the PLM ESM-2, we identify up to 2,548 human-interpretable latent features per layer that strongly correlate with up to 143 known biological concepts such as binding sites, structural motifs, and functional domains. In contrast, examining individual neurons in ESM-2 reveals up to 46 neurons per layer with clear conceptual alignment across 15 known concepts, suggesting that PLMs represent most concepts in superposition. Beyond capturing known annotations, we show that ESM-2 learns coherent concepts that do not map onto existing annotations and propose a pipeline using LLMs to automatically interpret novel latent features learned by the SAEs. As practical applications, we demonstrate how these latent features can fill in missing annotations in protein databases and enable targeted steering of protein sequence generation. Our results demonstrate that PLMs encode rich, interpretable representations of protein biology and we propose a systematic framework to extract and analyze these latent features. In the process, we recover both known biology and potentially new protein motifs. As community resources, we introduce InterPLM (interPLM.ai), an interactive visualization platform for exploring and analyzing learned PLM features, and release code for training and analysis at github.com/ElanaPearl/interPLM.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces InterPLM, a framework using Sparse Autoencoders to uncover interpretable biological features within Protein Language Models like ESM-2, identifying over 2,500 features related to concepts such as binding sites.
The research shows PLMs store information in superposition, finding that SAEs reveal vastly more biological concepts than analyzing individual neurons (over 2500 vs ~46).
These interpretable features offer practical applications for annotating protein databases and guiding protein sequence design, accessible via the InterPLM.ai interactive platform.

Analyzing InterPLM: Interpretability in Protein LLMs through Sparse Autoencoders

Protein LLMs (PLMs) represent a burgeoning domain in computational biology, offering powerful tools for predicting protein structures and functions. Despite their remarkable predictive capabilities, the internal mechanisms of PLMs remain largely opaque. The paper "InterPLM: Discovering Interpretable Features in Protein LLMs via Sparse Autoencoders," authored by Elana Simon and James Zou, addresses this knowledge gap by introducing a methodological framework to enhance the interpretability of PLMs using Sparse Autoencoders (SAEs). This essay provides a concise examination of the contributions and future implications of the work.

The paper targets the ESM-2 model, a significant PLM, and endeavors to elucidate its latent features by employing SAEs to decode more than 2,500 human-interpretable features per model layer. These features align with numerous biological concepts, including binding sites and structural motifs. The research identifies far more concepts than analysis at the individual neuron level, where only about 46 neurons per layer corresponded to known biological concepts. Consequently, this suggests that ESM-2 represents concepts in superposition rather than neatly mapping to specific neurons.

Methodology

The authors implement SAEs to transform the latent neuron activations in ESM-2 into sparse representations, thus revealing the compositional elements of neural activity. By evaluating the resulting features against biological annotations from Swiss-Prot, they validate the interpretability of these features. Additionally, they develop an automated tool using LLMs to assign functional descriptions to newly discovered features, which could potentially lead to identifying novel protein motifs or filling gaps in protein databases.

To substantiate practical applications, the researchers demonstrate that latent features uncovered by SAEs can guide the annotation of protein databases and support the targeted design of new protein sequences. Furthermore, the project culminates in the creation of InterPLM, an interactive platform available at interPLM.ai, which provides a comprehensive suite for exploring and visualizing the learned features, alongside the release of code on GitHub.

Results and Implications

The paper's findings reveal that SAEs dramatically increase the interpretability of PLMs by transforming neuronal outputs into distinct features that more directly align with known protein biology. Highlighting the comparative usefulness of features versus neurons, the research underscores the tendency of PLMs to store information in superposition, necessitating advanced interpretability tools to unpack complex internal representations.

Beyond confirming biological plausibility, these interpretable features can serve as crucial instruments for hypothesis generation in structural biology, supporting both the discovery of new structural motifs and the steering of sequence generation. The proposed framework offers a template for extending feature extraction methodologies to other models or domains and prompts a reconsideration of the balance between interpretability and model complexity.

Future Directions

This research opens multiple avenues for further exploration. As PLMs continue to evolve, applying similar interpretability frameworks to more complex models like AlphaFold could further our understanding of protein folding and dynamics. Moreover, deploying these techniques across larger datasets may refine our understanding of protein evolution and function beyond the current annotations in databases like Swiss-Prot.

The paper also suggests opportunities to improve model development, by tracking feature acquisition over training cycles to optimize the model architecture for capturing relevant biological patterns. Finally, leveraging the insights from this work could pave the way for novel interventions in synthetic biology and protein engineering by allowing for more controlled and targeted modifications of protein sequences based on interpreted model features.

In summary, the paper by Simon and Zou represents a significant step towards making PLMs a more transparent and actionable tool in biological research, blending interpretability with computational prowess in a manner that supports both theoretical understanding and practical application in the field of protein science.

PDF Markdown

Related Papers

GitHub

GitHub - ElanaPearl/InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders (135 stars)

Reddit

InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders (1 point, 0 comments)