Validity and extrapolation of fitness oracles beyond the training distribution

Ascertain whether the supervised fitness oracle—implemented as an ensemble of neural network regressors trained on lower-order mutational data—accurately captures the true protein fitness landscape and extrapolates reliably to sequences with many mutations relative to the parent sequence, particularly when used to evaluate variants far from the original training distribution in the TrpB and CreiLOV design spaces.

Background

To evaluate generated sequences at scale, the paper trains oracles (ensembles of MLPs) on available fitness data and uses their predictions as proxies for true fitness when exploring large design spaces. These oracles are trained primarily on mutants close to the parent sequence.

Accurate extrapolation of such oracles is critical for fair comparisons across methods and for assessing optimization outcomes, yet their reliability far from the training distribution—e.g., for variants with many mutations—remains explicitly uncertain, affecting conclusions about absolute performance and comparisons to directed evolution.

References

It is unclear if the oracle captures the true nature of the protein fitness landscape or extrapolates well to sequences with many mutations relative to the original fitness dataset from which the oracle was trained.

— Steering Generative Models with Experimental Data for Protein Fitness Optimization (2505.15093 - Yang et al., 21 May 2025) in Section: Protein fitness optimization task – Comparison to existing protein engineering methods

Validity and extrapolation of fitness oracles beyond the training distribution

Background

References

Related Problems