- The paper introduces FSFP, a framework that leverages few-shot learning to enhance protein language models using minimal wet-lab data.
- It demonstrates that integrating meta-transfer learning and listwise ranking can improve Spearman correlations from below 0.1 to over 0.5 with as few as 20 examples.
- The study combines MAML-based meta-training and auxiliary tasks to offer a robust, data-efficient method for advancing protein engineering and directed evolution.
Enhancing the Efficiency of Protein LLMs with Minimal Wet-Lab Data through Few-Shot Learning
The paper "Enhancing the Efficiency of Protein LLMs with Minimal Wet-Lab Data through Few-Shot Learning" presents a novel approach called FSFP (Few-Shot Learning for Protein Fitness Prediction), which aims to improve protein LLMs (PLMs) in conditions of limited labeled experimental data. This work integrates the methodologies of meta-transfer learning (MTL), learning to rank (LTR), and parameter-efficient fine-tuning into a comprehensive framework for enhancing the prediction accuracy of PLMs in predicting protein fitness, using only a few tens of experimental single-site mutants from the target protein.
FSFP Methodology
The FSFP framework addresses significant challenges in protein fitness prediction models: the scarcity of labeled data, the practical limitations of high-throughput assays, and the necessity for effective use of large unlabeled datasets. FSFP's major component, meta-transfer learning, involves leveraging existing labeled datasets to provide an efficient initial parameterization for PLMs, enhancing their adaptability when applied to new protein-targeting tasks. The approach involves:
- Building Auxiliary Tasks: This step collects labeled datasets of similar proteins and generates pseudo-labels via multiple sequence alignment (MSA) to create tasks for meta-training.
- Meta-Training: Employs the MAML (Model-Agnostic Meta-Learning) algorithm to train PLMs on these auxiliary tasks, resulting in a model that can rapidly adjust to new tasks with minimal data.
- Transfer Learning and LTR: FSFP employs listwise learning to rank (LTR), focusing on predicting relative fitness rather than absolute values, thereby addressing key aspects of directed evolution.
Empirical Results
The performance of FSFP was assessed using 87 deep mutational scanning datasets from ProteinGym, a comprehensive benchmark set. FSFP demonstrated robust predictive performance, particularly with datasets where baseline models such as ridge regression showed limitations. Key findings highlighted that FSFP could improve models with initially poor zero-shot performance (Spearman correlations below 0.1) to levels exceeding 0.5, using as few as 20 labeled examples.
Comparative Analysis
FSFP was compared against alternatives, such as zero-shot approaches, supervised models using ridge regression, and other parameter-efficient fine-tuning techniques. FSFP consistently outperformed these baselines, driven by its efficient data utilization and capacity to integrate MSA-derived evolutionary insights with supervised learning signals. Notably, FSFP showed significant gains when tested on SaProt, ESM-1v, and ESM-2, with SaProt (FSFP) emerging as the leading performer due to its dual encoding of structural and sequence data.
Implications and Future Directions
The findings of this paper imply a broad applicability of FSFP in AI-guided protein engineering, especially in scenarios constrained by limited experimental data. The framework's robust generalizability and high efficiency highlight its potential for significant impact on directed evolution and related protein engineering applications. Looking forward, further work might explore the utility of FSFP in other foundational models and investigate the integration of richer auxiliary datasets or improved optimization strategies to further enhance predictive performance.
This research demonstrates a clear advancement in utilizing few-shot learning strategies to refine PLMs, marking a step towards more efficient protein design processes while simultaneously maximizing the practical usage of minimal experimental data. The paper potentially paves the way for innovations in protein engineering, where computational efficiency and data scarcity often represent formidable obstacles.