Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling (2301.06568v1)

Published 16 Jan 2023 in cs.LG, cs.CL, cs.DC, and q-bio.QM

Abstract: As opposed to scaling-up protein LLMs (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the LLM size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google's TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10% for pre-training, <7% for inference, and <30% for the embedding dimension). We provide a representative range of structure and function benchmarks where Ankh excels. We further provide a protein variant generation analysis on High-N and One-N input data scales where Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics. We dedicate our work to promoting accessibility to research innovation via attainable resources.

Citations (66)

View on Semantic Scholar

Summary

The paper introduces a domain-specific optimization that enhances protein language modeling without scaling up model size.
It employs varied masking strategies, architecture tweaks, and efficient pre-training data usage to reduce parameters and boost performance.
The model shows versatile success in downstream tasks, offering faster feature extraction and lower computational requirements compared to larger models.

An Overview of Ankh: Optimized Protein LLM

The paper presents a novel approach to protein LLMing with the introduction of Ankh, an optimized Protein LLM (PLM). This work aims to improve PLMs through data-efficient and protein-specific optimization, rather than pursuing the typical route of scaling up model size. The authors emphasize this point by leveraging over twenty experiments that include variations in masking strategies, architecture, and pre-training data, ultimately leading to the development of Ankh. The intention is to explore the benefits of domain-specific adjustments to produce a more accessible and computationally efficient model, achieving high performance with reduced computational resources.

Ankh demonstrates its superior performance over existing PLMs, such as ProtTrans's ProtT5-XL-U50 and ESM models, with fewer parameters during both pre-training and inference phases. The model excels across a representative range of structure and function benchmarks. Notably, it improves average task performance by 4.8% utilizing less than 10% of the training parameters and reduces the embedding dimension by 30% in comparison to current state-of-the-art models. This is achieved while maintaining computational efficiency and offering high accessibility through usage of affordable resources.

The results show that Ankh supports sequence lengths of up to 1024 amino acids, with a notable reduction in feature extraction time compared to larger models like ESM-2 with 15B parameters. Ankh's computational demands are also notably lower, requiring less hardware for equivalent analyses, which presents a compelling argument for the shift towards smaller, optimized models driven by domain knowledge. In downstream tasks such as secondary structure prediction, contact prediction, fold prediction, and protein function prediction, Ankh consistently outperforms its counterparts.

In the field of protein variant generation, Ankh employs auto-regressive fine-tuning and masked LLMing (MLM) frameworks to accommodate both family-based and single-sequence generation tasks, affirming its versatility. This allows for detailed exploration-exploitation trade-offs, showing strong performance in mimicry of natural sequence distributions with minimal sample representation, suggesting robust generalization capabilities.

The authors prioritize knowledge-guided optimization, emphasizing the value of tailoring models to protein-specific needs, such as adopting relative positional embeddings and employing a Gated-GELU activation function. Such innovations facilitate improved performance without extensive increases in model or data size, advocating for the merits of intelligent model and software engineering approaches.

Overall, the paper suggests a paradigm shift in protein LLMing, underscoring the potential benefits of domain-specific optimizations over the prevalent trend of expanding model scale. This nuanced understanding and application could stimulate future work in creating specialized models for intricate protein-related tasks, possibly influencing other domains that rely on LLMs.

Ankh's successful demonstration of efficient and accessible protein modeling sets a benchmark for future developments in AI-driven biological research, pointing towards an era where practical, knowledge-rich optimizations might unlock further potential often overshadowed by simple scaling efforts. This work thus highlights a critical dialogue about balancing model performance with computational feasibility in the field of protein LLMing.

PDF Markdown

Related Papers

GitHub

GitHub - agemagician/Ankh: Ankh: Optimized Protein Language Model (227 stars)

YouTube

Show All Videos