Papers
Topics
Authors
Recent
2000 character limit reached

OFS Pseudo-perplexity for Protein Fitness

Updated 15 December 2025
  • The paper introduces OFS pseudo-perplexity, an efficient one-pass approximation for protein sequence fitness evaluation.
  • It leverages regression from unmasked residue embeddings to masked token distributions, reducing computational cost compared to traditional methods.
  • Empirical results demonstrate high correlation with true pseudo-perplexity and success in applications like sequence design and ancestral protein stability analysis.

One Fell Swoop (OFS) pseudo-perplexity is an efficient approximation of masked LLM (MLM) pseudo-perplexity for protein sequence fitness estimation. It leverages a regression from unmasked residue embeddings to masked token distributions, enabling quantification of model uncertainty in a single forward pass. OFS pseudo-perplexity retains the predictive power of true pseudo-perplexity at a fraction of the computational cost and facilitates high-throughput scoring, rapid sequence sampling, and in-depth analysis of sequence design and stability (Kantroo et al., 2024).

1. Pseudo-perplexity in Masked LLMs

Pseudo-perplexity in the context of protein LLMs utilizing MLM is defined to quantify the model’s uncertainty regarding a sequence. Given a protein sequence x=(x1,...,xn)x = (x_1, ..., x_n) of length nn over a vocabulary VV (the 20 amino acids plus special tokens), the model is trained to predict masked residues from surrounding context. For each position ii, the pseudo-log-likelihood is

L(x)=i=1nlogP(xixi)L(x) = \sum_{i=1}^n \log P(x_i \mid x_{-i})

where xix_{-i} is xx with the ii-th position replaced by a [MASK] token. From L(x)L(x), pseudo-perplexity is defined as

PPPL(x)=exp(1nL(x))\text{PPPL}(x) = \exp \left( -\frac{1}{n} L(x) \right )

Lower PPPL values indicate sequences the model considers more “natural” and, by consequence, of higher predicted fitness. The standard calculation of L(x)L(x) requires nn independent forward passes, rendering true pseudo-perplexity computationally expensive for long sequences (Kantroo et al., 2024).

2. One Fell Swoop Methodology

OFS pseudo-perplexity replaces the O(n)O(n) computational paradigm with a single forward pass and a lightweight, position-wise projection:

  1. Embedding Extraction: The unmasked input sequence xx is encoded with a pretrained LLM (e.g., ESM2) to yield E=Encoder(x)Rn×dE = \mathrm{Encoder}(x) \in \mathbb{R}^{n \times d}, with each row EiE_i representing the embedding of xix_i.
  2. Projection to Masked Profiles: An ensemble of multilayer perceptrons (MLPs), each equipped with ReLU activations and layer normalization, is trained to map EiE_i to vector z^iRV\hat{z}_i \in \mathbb{R}^V. The corresponding masked distribution is P^(xi)=softmax(z^i)\hat{P}(\cdot\,|\, x_{-i}) = \mathrm{softmax}(\hat{z}_i).
  3. Training Objective: The projection network ff is trained by minimizing the cross-entropy between P^(xi)\hat{P}(\cdot\,|\, x_{-i}) and the ground-truth masked distribution P(xi)P(\cdot\,|\, x_{-i}) derived via one-at-a-time single masking using a large protein sequence corpus (Kantroo et al., 2024).

The OFS pseudo-log-likelihood is then

L^OFS(x)=i=1nlogP^(xixi)\hat{L}_\mathrm{OFS}(x) = \sum_{i=1}^n \log \hat{P}(x_i | x_{-i})

and the OFS pseudo-perplexity

PPPLOFS(x)=exp(1nL^OFS(x))\text{PPPL}_\mathrm{OFS}(x) = \exp \left( -\frac{1}{n} \hat{L}_\mathrm{OFS}(x) \right)

3. Architecture and Computational Considerations

  • Encoder Model: ESM2, a 650M parameter transformer, remains unchanged.
  • Projection Ensemble: The projection comprises eight identical MLPs, each with two hidden layers, ReLU activations, and layer normalization, mapping d=1280d = 1280 to V=20V = 20 logits. Final prediction is the average softmax over ensemble members.
  • Computational Complexity: True pseudo-perplexity requires nn encoder passes, total cost nTenc\sim n \cdot T_{\mathrm{enc}}. OFS requires one encoder pass (TencT_{\mathrm{enc}}), plus one projection pass (TprojT_{\mathrm{proj}} \sim constant×n\times n), yielding total cost Tenc+TprojO(1)\sim T_{\mathrm{enc}} + T_{\mathrm{proj}} \approx O(1) compared to nTencn \cdot T_{\mathrm{enc}}. For typical n300n \sim 300, OFS is approximately 300-fold faster at inference (Kantroo et al., 2024).

4. Empirical Performance and Benchmarking

Substitutions Benchmark

  • Performance: On 61 substitution deep mutational scanning assays, true PPPL achieves aggregate Spearman ρ=0.57\rho = 0.57, OFS PPPL achieves ρ=0.55\rho = 0.55, and the masked-marginal (MM) heuristic for ESM2 attains ρ=0.430\rho = 0.430 (average, aggregate mean).
  • Comparisons: OFS is slightly weaker than MM on pure substitutions but remains competitive with peer sequence fitness models [(Kantroo et al., 2024), Table 1].

Indels Benchmark

  • OFS enables ESM2 to score insertions/deletions directly. On four indels assays, OFS PPPL achieves aggregate mean Spearman ρ=0.574\rho = 0.574, defining a new state of the art on ProteinGym for sequence-only models (Table 2, Figure 2C).

Approximation Fidelity

  • Profile Entry Accuracy: Mean absolute error for individual profile entries is less than 0.02.
  • Correlation: The correlation between L^OFS(x)\hat{L}_\mathrm{OFS}(x) and true L(x)L(x) exceeds 0.98 on held-out clusters.
  • Fitness Prediction Drop: Fitness prediction incurs less than a 5% drop, confirming high-fidelity regression from embeddings to masked distributions.

5. Applications

Monte Carlo Exploration of Sequence Space

OFS PPPL serves as an effective energy function E(x)=logPPPLOFS(x)E(x) = -\log PPPL_\mathrm{OFS}(x) in Metropolis–Hastings MCMC sampling. Mutation candidates uxu \to x' are proposed from the OFS profile, and acceptance probability is

A=min{1,exp(E(x)E(x)T)q(vx)q(ux)}A = \min \left \{ 1, \exp\left( -\frac{E(x') - E(x)}{T} \right) \frac{q(v | x')}{q(u|x)} \right \}

This protocol enables rapid generation of diverse, high-confidence protein variants (validity verified using pLDDT scores) in minutes (Figure 2 in (Kantroo et al., 2024)).

Fitness Estimation and Ancestral Stability

OFS PPPL enables direct ranking of extant and reconstructed ancestral proteins. Among 257 Pfam families, 79.8% exhibited lower PPPL (higher predicted stability) for ancestral reconstructions versus extant sequences (Cliff’s δ>+0.33\delta > +0.33), recapitulating the "old-is-gold" effect in ancestral sequence stability (Figure 3).

6. Limitations and Considerations

  • In-context Repetition: OFS sometimes induces low-diversity, repetitive patterns over long MCMC chains due to repetition artifacts.
  • Encoder Training Bias: Phylogenetic biases in the encoder’s training data can skew sequence space exploration toward over-represented clades.
  • Systematic Errors: The projection network inherits any systematic errors of the base encoder.
  • Approximate Nature: OFS remains a predictive approximation; while Spearman correlation is high, minor performance drops relative to true PPPL persist in certain benchmarks.

7. Summary and Significance

One Fell Swoop pseudo-perplexity transforms the computationally intensive MLM pseudo-perplexity calculation into an efficient single-step procedure by regressing from unmasked representations to masked residue distributions. It preserves nearly all predictive power of the original metric, enables scoring for sequence variants with indels, and underlies efficient sequence design routines such as MCMC exploration and ancestral sequence analysis. These properties establish OFS as a practical tool for protein fitness estimation and generative modeling in computational biology (Kantroo et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to One Fell Swoop (OFS) Pseudo-perplexity.