Training Compute-Optimal Protein Language Models (2411.02142v1)
Abstract: We explore optimally training protein LLMs, an area of significant interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains plateau, focusing primarily on increasing model sizes rather than optimizing the efficient compute frontier that balances performance and compute budgets. Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives. First, we observed the effect of diminishing returns for the Causal LLM (CLM) and that of overfitting for the Masked LLM~(MLM) when repeating the commonly used Uniref database. To address this, we included metagenomic protein sequences in the training set to increase the diversity and avoid the plateau or overfitting effects. Second, we obtained the scaling laws of CLM and MLM on Transformer, tailored to the specific characteristics of protein sequence data. Third, we observe a transfer scaling phenomenon from CLM to MLM, further demonstrating the effectiveness of transfer through scaling behaviors based on estimated Effectively Transferred Tokens. Finally, to validate our scaling laws, we compare the large-scale versions of ESM-2 and PROGEN2 on downstream tasks, encompassing evaluations of protein generation as well as structure- and function-related tasks, all within less or equivalent pre-training compute budgets.
- Bfd - big fantastic database. https://bfd.mmseqs.com.
- Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pages 265–279. PMLR, 2023.
- Eukaryotic genomes from a global metagenomic dataset illuminate trophic modes and biogeography of ocean plankton. bioRxiv, pages 2021–07, 2021.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- BFD Team. Big fantastic database. BFD Official Website, n.d.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
- Proteinbert: a universal deep-learning model of protein sequence and function. Bioinformatics, 38(8):2102–2110, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Language models are few-shot learners, 2020.
- Massive expansion of human gut bacteriophage diversity. Cell, 184(4):1098–1109, 2021.
- xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein. arXiv preprint arXiv:2401.06199, 2024.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Unified scaling laws for routed language models. In International conference on machine learning, pages 4057–4086. PMLR, 2022.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
- Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
- Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023.
- Functional repertoire convergence of distantly related eukaryotic plankton lineages abundant in the sunlit ocean. Cell Genomics, 2(5):100123, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2018.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022.
- Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
- Ankh: Optimized protein language model unlocks general-purpose modelling. arXiv preprint arXiv:2301.06568, 2023.
- Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021.
- European Bioinformatics Institute. Jackhmmer tool. EBI Tools Documentation, n.d.
- fast.ai. How could the memorization hypothesis be true. fast.ai Blog, 2023. Retrieved May 21, 2024, from https://www.fast.ai/posts/2023-09-04-learning-jumps/#how-could-the-memorization-hypothesis-be-true.
- Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
- Prostt5: Bilingual language model for protein sequence and structure. bioRxiv, pages 2023–07, 2023.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
- Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
- Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Hugging Face. Llama 2 model documentation, n.d.
- Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Aran Komatsuzaki. One epoch is all you need. arXiv preprint arXiv:1906.06669, 2019.
- Metaeuk—sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. Microbiome, 8:1–15, 2020.
- Feature reuse and scaling: Understanding transfer learning with protein language models. bioRxiv, pages 2024–02, 2024.
- Evolutionary-scale prediction of atomic level protein structure with a language model. bioRxiv, 2022.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
- Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
- Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Progen: Language modeling for protein generation. arXiv preprint arXiv:2004.03497, 2020.
- The integrated microbial genomes (img) system. Nucleic acids research, 34(suppl_1):D344–D348, 2006.
- An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.
- Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems, 34:29287–29303, 2021.
- Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. arXiv preprint arXiv:2010.09697, 2020.
- Colabfold: making protein folding accessible to all. Nature methods, 19(6):679–682, 2022.
- Mgnify: the microbiome analysis resource in 2020. Nucleic acids research, 48(D1):D570–D578, 2020.
- Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
- Metagenomic compendium of 189,680 dna viruses from the human gut microbiome. Nature microbiology, 6(7):960–970, 2021.
- Sequence modeling and design from molecular to genome scale with evo. bioRxiv, pages 2024–02, 2024.
- Progen2: exploring the boundaries of protein language models. Cell systems, 14(11):968–978, 2023.
- Proteingym: large-scale benchmarks for protein fitness prediction and design. Advances in Neural Information Processing Systems, 36, 2024.
- Instructplm: Aligning protein language models to follow protein structure instructions. bioRxiv, pages 2024–04, 2024.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nature methods, 9(2):173–175, 2012.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
- Unconstrained generation of synthetic antibody–antigen structures to guide machine learning methodology for antibody specificity prediction. Nature Computational Science, 2(12):845–865, 2022.
- Repeat or not repeat?—statistical validation of tandem repeat prediction in genomic sequences. Nucleic acids research, 40(20):10005–10017, 2012.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11):1026–1028, 2017.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6):926–932, 2015.
- Scale efficiently: Insights from pre-training and fine-tuning transformers. arXiv preprint arXiv:2109.10686, 2021.
- Ul2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022.
- Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Foldseek: fast and accurate protein structure search. Biorxiv, pages 2022–02, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Language models generalize beyond natural proteins. bioRxiv, pages 2022–12, 2022.
- Bert has a mouth, and it must speak: Bert as a markov random field language model. arXiv preprint arXiv:1902.04094, 2019.
- Deepnet: Scaling transformers to 1,000 layers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
- What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pages 22964–22984. PMLR, 2022.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022.
- When scaling meets llm finetuning: The effect of data, model and finetuning method. arXiv preprint arXiv:2402.17193, 2024.
- Structure-informed language models are protein designers. In International Conference on Machine Learning, pages 42317–42338. PMLR, 2023.