Generalization of protein language models beyond natural sequences
Determine whether specialized protein-based large language models trained on natural protein sequences can generalize to generate or design functional proteins whose sequences fall outside natural sequence distributions (i.e., "unnatural" sequences).
References
Specialized protein-based LLMs above have shown considerable promise; however, despite evidence that they can design protein sequences that align with user requirements, it remains to be seen whether these LLMs, trained on natural protein sequences, can generalize beyond these to unnatural sequences.
— Large Language Models in Drug Discovery and Development: From Disease Mechanisms to Clinical Trials
(2409.04481 - Zheng et al., 2024) in Section 3.2.2, In-silico Simulation — De novo Protein Generation