OFS Pseudo-perplexity for Protein Fitness
- The paper introduces OFS pseudo-perplexity, an efficient one-pass approximation for protein sequence fitness evaluation.
- It leverages regression from unmasked residue embeddings to masked token distributions, reducing computational cost compared to traditional methods.
- Empirical results demonstrate high correlation with true pseudo-perplexity and success in applications like sequence design and ancestral protein stability analysis.
One Fell Swoop (OFS) pseudo-perplexity is an efficient approximation of masked LLM (MLM) pseudo-perplexity for protein sequence fitness estimation. It leverages a regression from unmasked residue embeddings to masked token distributions, enabling quantification of model uncertainty in a single forward pass. OFS pseudo-perplexity retains the predictive power of true pseudo-perplexity at a fraction of the computational cost and facilitates high-throughput scoring, rapid sequence sampling, and in-depth analysis of sequence design and stability (Kantroo et al., 2024).
1. Pseudo-perplexity in Masked LLMs
Pseudo-perplexity in the context of protein LLMs utilizing MLM is defined to quantify the model’s uncertainty regarding a sequence. Given a protein sequence of length over a vocabulary (the 20 amino acids plus special tokens), the model is trained to predict masked residues from surrounding context. For each position , the pseudo-log-likelihood is
where is with the -th position replaced by a [MASK] token. From , pseudo-perplexity is defined as
Lower PPPL values indicate sequences the model considers more “natural” and, by consequence, of higher predicted fitness. The standard calculation of requires independent forward passes, rendering true pseudo-perplexity computationally expensive for long sequences (Kantroo et al., 2024).
2. One Fell Swoop Methodology
OFS pseudo-perplexity replaces the computational paradigm with a single forward pass and a lightweight, position-wise projection:
- Embedding Extraction: The unmasked input sequence is encoded with a pretrained LLM (e.g., ESM2) to yield , with each row representing the embedding of .
- Projection to Masked Profiles: An ensemble of multilayer perceptrons (MLPs), each equipped with ReLU activations and layer normalization, is trained to map to vector . The corresponding masked distribution is .
- Training Objective: The projection network is trained by minimizing the cross-entropy between and the ground-truth masked distribution derived via one-at-a-time single masking using a large protein sequence corpus (Kantroo et al., 2024).
The OFS pseudo-log-likelihood is then
and the OFS pseudo-perplexity
3. Architecture and Computational Considerations
- Encoder Model: ESM2, a 650M parameter transformer, remains unchanged.
- Projection Ensemble: The projection comprises eight identical MLPs, each with two hidden layers, ReLU activations, and layer normalization, mapping to logits. Final prediction is the average softmax over ensemble members.
- Computational Complexity: True pseudo-perplexity requires encoder passes, total cost . OFS requires one encoder pass (), plus one projection pass ( constant), yielding total cost compared to . For typical , OFS is approximately 300-fold faster at inference (Kantroo et al., 2024).
4. Empirical Performance and Benchmarking
Substitutions Benchmark
- Performance: On 61 substitution deep mutational scanning assays, true PPPL achieves aggregate Spearman , OFS PPPL achieves , and the masked-marginal (MM) heuristic for ESM2 attains (average, aggregate mean).
- Comparisons: OFS is slightly weaker than MM on pure substitutions but remains competitive with peer sequence fitness models [(Kantroo et al., 2024), Table 1].
Indels Benchmark
- OFS enables ESM2 to score insertions/deletions directly. On four indels assays, OFS PPPL achieves aggregate mean Spearman , defining a new state of the art on ProteinGym for sequence-only models (Table 2, Figure 2C).
Approximation Fidelity
- Profile Entry Accuracy: Mean absolute error for individual profile entries is less than 0.02.
- Correlation: The correlation between and true exceeds 0.98 on held-out clusters.
- Fitness Prediction Drop: Fitness prediction incurs less than a 5% drop, confirming high-fidelity regression from embeddings to masked distributions.
5. Applications
Monte Carlo Exploration of Sequence Space
OFS PPPL serves as an effective energy function in Metropolis–Hastings MCMC sampling. Mutation candidates are proposed from the OFS profile, and acceptance probability is
This protocol enables rapid generation of diverse, high-confidence protein variants (validity verified using pLDDT scores) in minutes (Figure 2 in (Kantroo et al., 2024)).
Fitness Estimation and Ancestral Stability
OFS PPPL enables direct ranking of extant and reconstructed ancestral proteins. Among 257 Pfam families, 79.8% exhibited lower PPPL (higher predicted stability) for ancestral reconstructions versus extant sequences (Cliff’s ), recapitulating the "old-is-gold" effect in ancestral sequence stability (Figure 3).
6. Limitations and Considerations
- In-context Repetition: OFS sometimes induces low-diversity, repetitive patterns over long MCMC chains due to repetition artifacts.
- Encoder Training Bias: Phylogenetic biases in the encoder’s training data can skew sequence space exploration toward over-represented clades.
- Systematic Errors: The projection network inherits any systematic errors of the base encoder.
- Approximate Nature: OFS remains a predictive approximation; while Spearman correlation is high, minor performance drops relative to true PPPL persist in certain benchmarks.
7. Summary and Significance
One Fell Swoop pseudo-perplexity transforms the computationally intensive MLM pseudo-perplexity calculation into an efficient single-step procedure by regressing from unmasked representations to masked residue distributions. It preserves nearly all predictive power of the original metric, enables scoring for sequence variants with indels, and underlies efficient sequence design routines such as MCMC exploration and ancestral sequence analysis. These properties establish OFS as a practical tool for protein fitness estimation and generative modeling in computational biology (Kantroo et al., 2024).