MeMDLM: De Novo Membrane Protein Design with Masked Discrete Diffusion Protein Language Models (2410.16735v1)
Abstract: Masked Diffusion LLMs (MDLMs) have recently emerged as a strong class of generative models, paralleling state-of-the-art (SOTA) autoregressive (AR) performance across natural LLMing domains. While there have been advances in AR as well as both latent and discrete diffusion-based approaches for protein sequence design, masked diffusion LLMing with protein LLMs (pLMs) is unexplored. In this work, we introduce MeMDLM, an MDLM tailored for membrane protein design, harnessing the SOTA pLM ESM-2 to de novo generate realistic membrane proteins for downstream experimental applications. Our evaluations demonstrate that MeMDLM-generated proteins exceed AR-based methods by generating sequences with greater transmembrane (TM) character. We further apply our design framework to scaffold soluble and TM motifs in sequences, demonstrating that MeMDLM-reconstructed sequences achieve greater biological similarity to their original counterparts compared to SOTA inpainting methods. Finally, we show that MeMDLM captures physicochemical membrane protein properties with similar fidelity as SOTA pLMs, paving the way for experimental applications. In total, our pipeline motivates future exploration of MDLM-based pLMs for protein design.