Existence and form of a protein language grammar for pLMs

Ascertain whether general biological rules analogous to a "protein language grammar" exist within protein sequences and, if they do, characterize the form such a grammar takes; additionally, identify which combinations of explainable artificial intelligence methods and information sources (training sequences, input prompts, model components, and output sequence perturbations) are required to extract these rules from decoder-only Transformer-based protein language models used for protein design.

Background

The paper defines five roles for explainability in protein research—Evaluator, Multitasker, Engineer, Coach, and Teacher—and argues that the Teacher role would be the most impactful for life science, as it would recover general biological rules (a protein "grammar") directly from protein LLMs.

While attention analyses and other XAI techniques have been used largely for evaluation and prediction tasks, the authors note that extracting a protein grammar remains a challenge. They highlight uncertainty both about whether such a grammar exists and, if it does, how to uncover it through specific combinations of XAI methods and information categories within the pLM workflow.

References

However, because it is unclear if and in what form a protein language grammar exists, the combination of method and information category to enable this role remains unclear.

Toward the Explainability of Protein Language Models for Sequence Design (2506.19532 - Hunklinger et al., 24 Jun 2025) in Potential roles for XAI methods in protein design, Teacher role paragraph