Enhancing the efficiency of protein language models with minimal wet-lab data through few-shot learning (2402.02004v1)

Published 3 Feb 2024 in q-bio.BM

Abstract: Accurately modeling the protein fitness landscapes holds great importance for protein engineering. Recently, due to their capacity and representation ability, pre-trained protein LLMs have achieved state-of-the-art performance in predicting protein fitness without experimental data. However, their predictions are limited in accuracy as well as interpretability. Furthermore, such deep learning models require abundant labeled training examples for performance improvements, posing a practical barrier. In this work, we introduce FSFP, a training strategy that can effectively optimize protein LLMs under extreme data scarcity. By combining the techniques of meta-transfer learning, learning to rank, and parameter-efficient fine-tuning, FSFP can significantly boost the performance of various protein LLMs using merely tens of labeled single-site mutants from the target protein. The experiments across 87 deep mutational scanning datasets underscore its superiority over both unsupervised and supervised approaches, revealing its potential in facilitating AI-guided protein design.

Citations (5)

View on Semantic Scholar

Summary

The paper introduces FSFP, a framework that leverages few-shot learning to enhance protein language models using minimal wet-lab data.
It demonstrates that integrating meta-transfer learning and listwise ranking can improve Spearman correlations from below 0.1 to over 0.5 with as few as 20 examples.
The study combines MAML-based meta-training and auxiliary tasks to offer a robust, data-efficient method for advancing protein engineering and directed evolution.

Enhancing the Efficiency of Protein LLMs with Minimal Wet-Lab Data through Few-Shot Learning

The paper "Enhancing the Efficiency of Protein LLMs with Minimal Wet-Lab Data through Few-Shot Learning" presents a novel approach called FSFP (Few-Shot Learning for Protein Fitness Prediction), which aims to improve protein LLMs (PLMs) in conditions of limited labeled experimental data. This work integrates the methodologies of meta-transfer learning (MTL), learning to rank (LTR), and parameter-efficient fine-tuning into a comprehensive framework for enhancing the prediction accuracy of PLMs in predicting protein fitness, using only a few tens of experimental single-site mutants from the target protein.

FSFP Methodology

The FSFP framework addresses significant challenges in protein fitness prediction models: the scarcity of labeled data, the practical limitations of high-throughput assays, and the necessity for effective use of large unlabeled datasets. FSFP's major component, meta-transfer learning, involves leveraging existing labeled datasets to provide an efficient initial parameterization for PLMs, enhancing their adaptability when applied to new protein-targeting tasks. The approach involves:

Building Auxiliary Tasks: This step collects labeled datasets of similar proteins and generates pseudo-labels via multiple sequence alignment (MSA) to create tasks for meta-training.
Meta-Training: Employs the MAML (Model-Agnostic Meta-Learning) algorithm to train PLMs on these auxiliary tasks, resulting in a model that can rapidly adjust to new tasks with minimal data.
Transfer Learning and LTR: FSFP employs listwise learning to rank (LTR), focusing on predicting relative fitness rather than absolute values, thereby addressing key aspects of directed evolution.

Empirical Results

The performance of FSFP was assessed using 87 deep mutational scanning datasets from ProteinGym, a comprehensive benchmark set. FSFP demonstrated robust predictive performance, particularly with datasets where baseline models such as ridge regression showed limitations. Key findings highlighted that FSFP could improve models with initially poor zero-shot performance (Spearman correlations below 0.1) to levels exceeding 0.5, using as few as 20 labeled examples.

Comparative Analysis

FSFP was compared against alternatives, such as zero-shot approaches, supervised models using ridge regression, and other parameter-efficient fine-tuning techniques. FSFP consistently outperformed these baselines, driven by its efficient data utilization and capacity to integrate MSA-derived evolutionary insights with supervised learning signals. Notably, FSFP showed significant gains when tested on SaProt, ESM-1v, and ESM-2, with SaProt (FSFP) emerging as the leading performer due to its dual encoding of structural and sequence data.

Implications and Future Directions

The findings of this paper imply a broad applicability of FSFP in AI-guided protein engineering, especially in scenarios constrained by limited experimental data. The framework's robust generalizability and high efficiency highlight its potential for significant impact on directed evolution and related protein engineering applications. Looking forward, further work might explore the utility of FSFP in other foundational models and investigate the integration of richer auxiliary datasets or improved optimization strategies to further enhance predictive performance.

This research demonstrates a clear advancement in utilizing few-shot learning strategies to refine PLMs, marking a step towards more efficient protein design processes while simultaneously maximizing the practical usage of minimal experimental data. The paper potentially paves the way for innovations in protein engineering, where computational efficiency and data scarcity often represent formidable obstacles.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/anthonygitter/status/1756082019396632857

https://twitter.com/rkakamilan/status/1755608698674909278

https://twitter.com/Pastel/status/1754770020469551418