MediSwift: Efficient Sparse Pre-trained Biomedical Language Models (2403.00952v2)
Abstract: LLMs are typically trained on general source data for various domains, but a recent surge in domain-specific LLMs has shown their potential to outperform general-purpose models in domain-specific tasks (e.g., biomedicine). Although domain-specific pre-training enhances efficiency and leads to smaller models, the computational costs of training these LLMs remain high, posing budgeting challenges. We introduce MediSwift, a suite of biomedical LMs that leverage sparse pre-training on domain-specific biomedical text data. By inducing up to 75% weight sparsity during the pre-training phase, MediSwift achieves a 2-2.5x reduction in training FLOPs. Notably, all sparse pre-training was performed on the Cerebras CS-2 system, which is specifically designed to realize the acceleration benefits from unstructured weight sparsity, thereby significantly enhancing the efficiency of the MediSwift models. Through subsequent dense fine-tuning and strategic soft prompting, MediSwift models outperform existing LLMs up to 7B parameters on biomedical tasks, setting new benchmarks w.r.t efficiency-accuracy on tasks such as PubMedQA. Our results show that sparse pre-training, along with dense fine-tuning and soft prompting, offers an effective method for creating high-performing, computationally efficient models in specialized domains.
- Gpt-4 technical report. arXiv.
- Exploiting unstructured sparsity on next-generation datacenter hardware.
- Layer normalization. arXiv.
- Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics.
- Biomedlm.
- Language models are few-shot learners. In NeurIPS.
- Cerebras. 2023. Train a model with weight sparsity. Cerebras Wafer-Scale cluster (R2.1.1) Documentation.
- Pixelated butterfly: Simple and efficient sparse training for neural network models. In ICLR.
- Evaluating large language models trained on code. arXiv.
- The lottery ticket hypothesis for pre-trained bert networks. In NeurIPS.
- Meditron-70b: Scaling medical pretraining for large language models. arXiv.
- Med42 - a clinical large language model.
- Lamda: Language models for dialog applications. In arXiv.
- Together Computer. 2023. Redpajama: an open dataset for training large language models.
- Monarch: Expressive structured matrices for efficient and accurate training. In ICML.
- Flashattention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS.
- Qlora: Efficient finetuning of quantized llms. arXiv.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv.
- Towards structured dynamic sparse pre-training of bert.
- Fast sparse convnets. arXiv.
- Rigging the lottery: Making all tickets winners. In ICML.
- The difficulty of training sparse neural networks. arXiv.
- Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR.
- The state of sparsity in deep neural networks. arXiv.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv.
- Are wider nets better given the same number of parameters? In ICLR.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare.
- Abhay Gupta. 2024. Sparsity made easy - introducing the cerebras pytorch sparsity library. Cerebras Systems Blog.
- Deep learning scaling is predictable, empirically. arXiv.
- An empirical analysis of compute-optimal large language model training. In NeurIPS.
- Sara Hooker. 2020. The Hardware Lottery. arXiv.
- Lora: Low-rank adaptation of large language models. arXiv.
- Sparse is enough in scaling transformers. In NeurIPS.
- Pubmedqa: A dataset for biomedical research question answering. In EMNLP-IJCNLP.
- BioELECTRA:pretrained biomedical text encoder using discriminators. In Workshop on Biomedical Language Processing.
- Scaling laws for neural language models. arXiv.
- Moses: Open source toolkit for statistical machine translation. In ACL.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics.
- The stability-efficiency dilemma: Investigating sequence length warmup for training GPT models. In NeurIPS.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In ACL.
- Relora: High-rank training through low-rank updates. arXiv.
- Sean Lie. 2021. Thinking outside the die: Architecting the ml accelerator of the future.
- Sean Lie. 2022a. Cerebras architecture deep dive: First look inside the hw/sw co-design for deep learning. Cerebras Systems Blog.
- Sean Lie. 2022b. Harnessing the Power of Sparsity for Large GPT AI Models. Cerebras Systems Blog.
- Sparse training via boosting pruning plasticity with neuroregeneration.
- The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. arXiv.
- P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In ACL.
- Can large language models reason about medical questions? arXiv.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv.
- BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics.
- Effective model sparsification by scheduled grow-and-prune methods. In ICLR.
- Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp. arXiv.
- ReLU strikes back: Exploiting activation sparsity in large language models. In ICLR.
- Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications.
- Bethesda (MD) National Library of Medicine. 2003–2023. Pmc open access subset.
- NeuralMagic. 2021. Deepsparse.
- Capabilities of gpt-4 on medical challenge problems. arXiv.
- Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv.
- A study of generative large language model for medical research and healthcare. NPJ Digital Medicine.
- What’s hidden in a randomly weighted neural network? In CVPR.
- Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. arXiv.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv.
- Large language models encode clinical knowledge. arXiv.
- Towards expert-level medical question answering with large language models. arXiv.
- 1-bit adam: Communication efficient large-scale training with adam’s convergence speed. arXiv.
- Torchsparse++: Efficient training and inference framework for sparse convolution on gpus. In IEEE/ACM MICRO.
- Galactica: A large language model for science. arXiv.
- SPDF: Sparse pre-training and dense fine-tuning for large language models. In UAI.
- Sparse iso-FLOP transformations for maximizing training efficiency. In NeurIPS Workshop on Advancing Neural Network Training.
- Clinical camel: An open expert-level medical language model with dialogue-based knowledge encoding. arXiv.
- Llama: Open and efficient foundation language models. arXiv.
- Llama 2: Open foundation and fine-tuned chat models. arXiv.
- Attention is all you need. In NeurIPS.
- Ziheng Wang. 2021. Sparsednn: Fast sparse deep learning inference on cpus. arXiv.
- Pmc-llama: Towards building open-source language models for medicine. arXiv.
- Bloomberggpt: A large language model for finance. arXiv.
- Medlm: Exploring language models for medical question answering systems. arXiv.
- Gatortron: A large language model for clinical natural language processing. medRxiv.
- Deep bidirectional language-knowledge graph pretraining. In NeurIPS.
- LinkBERT: Pretraining language models with document links. In ACL.
- Opt: Open pre-trained transformer language models. arXiv.
- Fine-tuning language models from human preferences. arXiv.
- Vithursan Thangarasa (13 papers)
- Mahmoud Salem (5 papers)
- Shreyas Saxena (9 papers)
- Kevin Leong (3 papers)
- Joel Hestness (23 papers)
- Sean Lie (7 papers)