Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Guide your favorite protein sequence generative model (2505.04823v2)

Published 7 May 2025 in cs.LG and q-bio.BM

Abstract: Generative machine learning models on sequences are transforming protein engineering. However, no principled framework exists for conditioning these models on auxiliary information, such as experimental data, in a plug-and-play manner. Herein, we present ProteinGuide -- a principled and general method for conditioning -- by unifying a broad class of protein generative models under a single framework. We demonstrate the applicability of ProteinGuide by guiding two protein generative models, ProteinMPNN and ESM3, to generate amino acid and structure token sequences, conditioned on several user-specified properties such as enhanced stability, enzyme classes, and CATH-labeled folds. We also used ProteinGuide with inverse folding models and our own experimental assay to design adenine base editor sequences for high activity.

Summary

Overview of "Guide your favorite protein sequence generative model"

The paper introduces "ProteinGuide," an advanced framework designed to condition generative models for protein sequences on specific attributes without necessitating full retraining of existing models. The authors argue that while generative machine learning models have significantly advanced protein engineering, a flexible system for guiding these models using auxiliary data, such as experimental feedback or classification outputs, has been lacking until now.

Key Contributions

  1. Unified Framework for Generative Models: ProteinGuide effectively unifies different classes of protein generative models, including masked LLMs (MLMs), order-agnostic autoregressive models (OA-AR), diffusion, and flow-matching models into a single framework. This integration enables these models to be conditioned on user-specified properties seamlessly, enhancing the generative capabilities without modifying the inherent parameters of the pre-trained models.
  2. Plug-and-Play Guidance without Retraining: A significant emphasis is placed on the plug-and-play nature of ProteinGuide. The framework allows generative models to incorporate user-defined conditions without requiring time-intensive retraining. This approach facilitates the generation of sequences with desired properties based on external datasets or classifiers.
  3. Application to Existing Models: The paper demonstrates the utility of ProteinGuide by applying it to two widely-used generative models: ProteinMPNN and ESM3. Through guided generation, these models are conditioned to output protein sequences and structure tokens with enhanced stability and specific folds, respectively. This showcases the capability of ProteinGuide to adapt existing models without further development or substantial computational costs.
  4. Technical Innovations in Guidance for Discrete Spaces: Building on prior work related to diffusion and flow models on discrete spaces, ProteinGuide extends guidance to additional model classes by leveraging cross-framework equivalences in their training losses. The framework employs modifications in transition rates to incorporate desired characteristic guidance in a statistically principled manner.

Results

The authors provide numerical results demonstrating this guiding framework's benefits:

  • Enhanced Stability and Folding in ProteinMPNN: Upon employing ProteinGuide for stability-guided sequence generation, there is a statistically significant increase in sequences that not only match the desired structural fold but also demonstrate increased stability over the baseline model's sequences.
  • Fold Class-Specific Structure Generation with ESM3: Similarly, ESM3, guided by ProteinGuide, yields structure tokens in line with predefined CATH fold classifications, proving the method's efficacy across multiple hierarchical levels of structure complexity.

Implications

The introduction of ProteinGuide presents several implications for protein engineering and generative modeling:

  • Efficiency in Design: By allowing real-time conditioning without retraining, practitioners can iterate over sequence designs more rapidly, integrating experimental feedback as it becomes available.
  • Versatility of Framework: The ability to apply this framework across various model architectures allows for widespread adoption in diverse biological and chemical modeling tasks.
  • Potential for Broader AI Developments: Beyond protein modeling, this work adds to the growing field of flexible generative models capable of incorporating a wide array of feedback without compromising model integrity or requiring exhaustive dataset expansions.

Future Directions

The paper suggests avenues for future exploration, such as enhancing the guidance methodology to accommodate even broader classes of models or improving the predictor-guided generation's statistical efficiency. Additionally, further experimentation with live experimental feedback could refine ProteinGuide's practical applications, ensuring it remains at the forefront of computational biology and artificial intelligence intersections.