Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining (2312.17482v2)

Published 29 Dec 2023 in cs.CL and cs.LG

Abstract: Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked LLMing (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models. We open source our model weights and code.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Dynamic masking rate schedules for mlm pretraining. arXiv preprint arXiv:2305.15096, 2023.
  2. Pre-train or annotate? domain adaptation with a constrained budget. arXiv preprint arXiv:2109.04711, 2021.
  3. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019.
  4. The fifth pascal recognizing textual entailment challenge. In TAC. Citeseer, 2009.
  5. Reduce, reuse, recycle: Improving training efficiency with distillation. arXiv preprint arXiv:2211.00683, 2022.
  6. Biomedlm. https://crfm.stanford.edu/2022/12/15/biomedlm.html, December 2022. Accessed on 15th May 2023.
  7. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
  10. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment: First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, pages 177–190. Springer, 2006.
  11. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  12. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
  13. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  15. B. Dolan and C. Brockett. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005), 2005.
  16. Re-train or train from scratch? comparing pre-training strategies of bert in the medical domain. In LREC 2022, pages 2626–2633, 2022.
  17. J. Geiping and T. Goldstein. Cramming: Training a language model on a single gpu in one day. In International Conference on Machine Learning, pages 11117–11143. PMLR, 2023.
  18. The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1–9, 2007.
  19. A. Gokaslan and V. Cohen. Openwebtext corpus, 2019. URL http://Skylion007.github.io/OpenWebTextCorpus.
  20. Efficient training of bert by progressively stacking. In International conference on machine learning, pages 2337–2346. PMLR, 2019.
  21. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  22. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543, 2021.
  23. Query-key normalization for transformers. arXiv preprint arXiv:2010.04245, 2020.
  24. Foundation models of scientific knowledge for chemistry: Opportunities, challenges and lessons learned. In Proceedings of BigScience Episode\normal-\\backslash\# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 160–172, 2022.
  25. First quora dataset release: Question pairs, 2017. URL https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
  26. Unsupervised dense information retrieval with contrastive learning. 2022.
  27. How to train bert with an academic budget. arXiv preprint arXiv:2104.07705, 2021.
  28. No train no gain: Revisiting efficient training algorithms for transformer-based language models. arXiv preprint arXiv:2307.06440, 2023.
  29. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  30. Mosaicml delivers leading nlp performance in mlperf v2.1, 2022. URL https://www.mosaicml.com/blog/mlperf-nlp-nov2022.
  31. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198, 2022.
  32. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
  33. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  34. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  35. Narrowbert: Accelerating masked language model pretraining and inference. arXiv preprint arXiv:2301.04761, 2023.
  36. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  37. MosaicML NLP Team. Introducing mpt-30b: Raising the bar for open-source foundation models. www.mosaicml.com/blog/mpt-30b, 2023a. Accessed: 2023-06-22.
  38. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms. www.mosaicml.com/blog/mpt-7b, 2023b. Accessed: 2023-05-05.
  39. MosaicML Research Team. Methodology, 2022. URL https://www.mosaicml.com/blog/methodology.
  40. S. Nagel. Common crawl blog - news dataset available, 2023. URL https://commoncrawl.org/blog/news-dataset-available. Accessed: 2016-10-4.
  41. Do transformer modifications transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021.
  42. Fast benchmarking of accuracy vs. training time with cyclic learning rates. arXiv preprint arXiv:2206.00832, 2022.
  43. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  44. M. N. Rabe and C. Staats. Self-attention does not need o⁢(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
  45. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  46. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  47. D. P. Rodgers. Improvements in multiprocessor system design. ACM SIGARCH Computer Architecture News, 13(3):225–231, 1985.
  48. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. arXiv preprint arXiv:2211.00083, 2022.
  49. N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  50. Mathbert: A pre-trained language model for general nlp tasks in mathematics education. arXiv preprint arXiv:2106.07340, 2021.
  51. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  52. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  53. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
  54. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  55. Code and named entity recognition in stackoverflow. arXiv preprint arXiv:2005.01634, 2020.
  56. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
  57. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
  58. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  59. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  60. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  61. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
  62. Should you mask 15% in masked language modeling? arXiv preprint arXiv:2202.08005, 2022.
  63. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
  64. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  65. Boosting distributed training performance of the unpadded bert model. arXiv preprint arXiv:2208.08124, 2022.
  66. B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  67. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jacob Portes (6 papers)
  2. Alex Trott (3 papers)
  3. Sam Havens (6 papers)
  4. Daniel King (18 papers)
  5. Abhinav Venigalla (5 papers)
  6. Moin Nadeem (8 papers)
  7. Nikhil Sardana (5 papers)
  8. Daya Khudia (6 papers)
  9. Jonathan Frankle (37 papers)
Citations (9)

Summary

Overview

MosaicBERT represents a significant evolution in the field of NLP, where BERT-style encoder models are essential tools. The main objective of this new architecture, MosaicBERT, is to optimize pretraining speed while maintaining accuracy in LLM development. MosaicBERT achieves this through a fusion of modern transformer architectures and efficient training techniques, resulting in a nimble and cost-effective approach for researchers and engineers alike.

Architectural Enhancements

At the heart of MosaicBERT are several key architectural changes designed to accelerate pretraining. These include FlashAttention, which streamlines memory operations and thus speeds up processing, and Attention with Linear Biases (ALiBi), which efficiently encodes positional information without learned embeddings. Additionally, improvements like incorporating Gated Linear Units (GLUs) in the feedforward layers, utilizing low precision LayerNorm, and a dynamic mechanism to avoid computational waste on padding tokens play crucial roles in boosting pretraining efficiency.

Pretraining Acceleration

One of the remarkable features of MosaicBERT is its ability to achieve impressive downstream task performance on the GLUE (General Language Understanding Evaluation) benchmark using minimal resources. For instance, MosaicBERT-Base reached an average GLUE score of 79.6% in only 1.13 hours on eight A100 80 GB GPUs, a feat that would cost approximately $20. This marked improvement in training time and expense opens the door for custom BERT-style model development, tailored to specific domains without the prohibitive costs usually associated with such endeavors.

Optimal Performance

MosaicBERT's architecture and pretraining strategies are empirically demonstrated to be highly efficient. The model not only achieves high scores on language understanding benchmarks rapidly but also does so with optimality in terms of accuracy versus training time, a relationship characterized through Pareto curves. MosaicBERT's Base and Large models are systematically compared with the standard BERT models, and both are shown to be Pareto optimal, indicating that they strike an ideal balance between speed and performance.

Conclusion and Contributions

MosaicBERT stands out as a significant contribution to the field of NLP. By combining tested and novel architectural features, along with an optimized training recipe, it delivers a highly efficient and effective model. It offers a practical solution for pretraining custom BERT models swiftly and economically, empowering researchers to push the boundaries of language processing innovations. This inclusive approach heralds a new wave of NLP research, moving away from universal model finetuning to domain-specific pretraining, and further expanding the potential of these powerful LLMs. The authors have made their model weights and code available, underscoring their commitment to facilitating advancement and collaboration within the NLP community.