Generalization of protein language models beyond natural sequences

Determine whether specialized protein-based large language models trained on natural protein sequences can generalize to generate or design functional proteins whose sequences fall outside natural sequence distributions (i.e., "unnatural" sequences).

Background

In the discussion of de novo protein generation, the paper reviews specialized protein LLMs that are trained on large corpora of natural protein sequences and have demonstrated promising results in both unconstrained and constrained sequence design. These capabilities include models that generate novel sequences and models that condition on desired structural or functional constraints.

Despite these successes, the authors explicitly flag uncertainty about whether such models can truly generalize beyond the distribution of natural sequences to produce functional "unnatural" proteins. They note subsequent work showing experimental successes, but raise the broader question of generalization as a key unresolved issue for protein design with LLMs.

References

Specialized protein-based LLMs above have shown considerable promise; however, despite evidence that they can design protein sequences that align with user requirements, it remains to be seen whether these LLMs, trained on natural protein sequences, can generalize beyond these to unnatural sequences.

Large Language Models in Drug Discovery and Development: From Disease Mechanisms to Clinical Trials  (2409.04481 - Zheng et al., 2024) in Section 3.2.2, In-silico Simulation — De novo Protein Generation