Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models (2505.22869v1)

Published 28 May 2025 in cs.CV, cs.LG, and q-bio.BM

Abstract: Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion LLM for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.

Summary

  • The paper introduces CFP-Gen, a diffusion-based language model for combinatorial functional protein generation that integrates multimodal conditions (functional, sequence, structural) to overcome limitations of existing PLMs.
  • CFP-Gen utilizes Annotation-Guided Feature Modulation (AGFM) for dynamic functional control and Residue-Controlled Functional Encoding (RCFE) for capturing residue-wise interactions within the protein sequence.
  • Evaluations demonstrate CFP-Gen's effectiveness, achieving a 30% F1-score improvement for functional predictors and a 9% AAR improvement in inverse folding compared to benchmarks, enabling the design of novel proteins with desired functions.

Overview of CFP-Gen: Combinatorial Functional Protein Generation via Diffusion LLMs

The paper introduces CFP-Gen, a diffusion-based LLM designed explicitly for combinatorial functional protein generation. This advancement addresses a significant limitation of existing protein LLMs (PLMs) that generate protein sequences primarily under single-condition constraints from specific modalities. CFP-Gen mitigates these constraints by integrating multimodal conditions across functional, sequence, and structural modalities, facilitating de novo protein design with enhanced functionality.

Traditional PLMs in protein science are predominantly unconditional, limiting their applicability in the design of biologically meaningful proteins where multiple functional constraints across diverse modalities are paramount. The CFP-Gen model provides a robust solution, revolutionizing protein generation with a unified model capable of simultaneous multi-constraint handling. This novel approach is composed of two pivotal modules: Annotation-Guided Feature Modulation (AGFM) and Residue-Controlled Functional Encoding (RCFE). Additionally, innovative integration of off-the-shelf 3D structure encoders ensures that the model can impose geometric constraints effectively.

Methodological Innovations

  1. Annotation-Guided Feature Modulation (AGFM): This module dynamically adjusts protein feature distribution based on composable functional annotations such as Gene Ontology (GO) terms, InterPro (IPR) domains, and Enzyme Commission (EC) numbers. AGFM modulates the normalized feature distributions informed by available annotation tags, providing a more nuanced interaction than classifier-guided diffusion frameworks. This joint-training capacity ensures strict alignment between functional annotations and sequence outputs while allowing for flexible combinations of functional constraints.
  2. Residue-Controlled Functional Encoding (RCFE): RCFE captures residue-wise interactions, focusing on critical functional domains within protein sequences. Its transformer-based architecture effectively encodes evolutionary relationships and epistasis among residues, ensuring the generation of optimized sequences, especially where specific sequence motifs are crucial.

Evaluation and Results

CFP-Gen was rigorously evaluated across multiple protein design tasks, including functional sequence generation and inverse folding, as well as multi-objective protein design. Results were quantified using metrics such as ESM3's performance, leading-function predictors' F1-score, and Amino Acid Recovery (AAR) rates, demonstrating significant improvements across these evaluation criteria. Notably, CFP-Gen achieved a 30% improvement in F1-score compared to ESM3 for functional predictors and displayed a 9% improvement in AAR compared to DPLM in inverse folding tasks. These results emphasize CFP-Gen's ability to generate novel proteins with functionalities comparable to natural ones, supporting its potential as a valuable computational tool in addressing complex biomedical applications.

Practical and Theoretical Implications

The introduction of CFP-Gen signifies a meaningful shift in how proteins can be computationally designed to meet diverse functional requirements. Practically, this model holds promise for applications in drug development, enzyme engineering, and therapeutic protein design. Theoretically, CFP-Gen challenges established paradigms by demonstrating that a diffusion LLM, enriched with multimodal conditions, can achieve comprehensive multi-objective optimization in protein generation. This potentially shifts the focus of protein design from iterative, condition-specific processes to integrated frameworks that consider the interconnectedness of biological functions, sequences, and structures.

Speculative Outlook

The methodological advancements in CFP-Gen pave the way for future developments in AI-driven protein design. Given the results, one could anticipate further research into expanding the model's conditional vocabulary, enhancing its ability to handle rare protein classes, and fine-tuning its capabilities for direct applications in tissue-specific or disease-specific therapeutic developments. Additionally, the architecture and principles of CFP-Gen could inspire the broader case for AI innovation in other areas of combinatorial generation tasks beyond protein science.

In summary, CFP-Gen exemplifies an intricate fusion of multimodal annotation-driven design into diffusion LLMs, setting a new standard for functional protein generation. Its contributions significantly advance the possibilities for precise control in protein engineering, making it a valuable asset for researchers committed to leveraging computational tools in biotechnology and synthetic biology.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub