Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MediSwift: Efficient Sparse Pre-trained Biomedical Language Models (2403.00952v2)

Published 1 Mar 2024 in cs.CL and cs.LG

Abstract: LLMs are typically trained on general source data for various domains, but a recent surge in domain-specific LLMs has shown their potential to outperform general-purpose models in domain-specific tasks (e.g., biomedicine). Although domain-specific pre-training enhances efficiency and leads to smaller models, the computational costs of training these LLMs remain high, posing budgeting challenges. We introduce MediSwift, a suite of biomedical LMs that leverage sparse pre-training on domain-specific biomedical text data. By inducing up to 75% weight sparsity during the pre-training phase, MediSwift achieves a 2-2.5x reduction in training FLOPs. Notably, all sparse pre-training was performed on the Cerebras CS-2 system, which is specifically designed to realize the acceleration benefits from unstructured weight sparsity, thereby significantly enhancing the efficiency of the MediSwift models. Through subsequent dense fine-tuning and strategic soft prompting, MediSwift models outperform existing LLMs up to 7B parameters on biomedical tasks, setting new benchmarks w.r.t efficiency-accuracy on tasks such as PubMedQA. Our results show that sparse pre-training, along with dense fine-tuning and soft prompting, offers an effective method for creating high-performing, computationally efficient models in specialized domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Gpt-4 technical report. arXiv.
  2. Exploiting unstructured sparsity on next-generation datacenter hardware.
  3. Layer normalization. arXiv.
  4. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics.
  5. Biomedlm.
  6. Language models are few-shot learners. In NeurIPS.
  7. Cerebras. 2023. Train a model with weight sparsity. Cerebras Wafer-Scale cluster (R2.1.1) Documentation.
  8. Pixelated butterfly: Simple and efficient sparse training for neural network models. In ICLR.
  9. Evaluating large language models trained on code. arXiv.
  10. The lottery ticket hypothesis for pre-trained bert networks. In NeurIPS.
  11. Meditron-70b: Scaling medical pretraining for large language models. arXiv.
  12. Med42 - a clinical large language model.
  13. Lamda: Language models for dialog applications. In arXiv.
  14. Together Computer. 2023. Redpajama: an open dataset for training large language models.
  15. Monarch: Expressive structured matrices for efficient and accurate training. In ICML.
  16. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS.
  17. Qlora: Efficient finetuning of quantized llms. arXiv.
  18. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv.
  19. Towards structured dynamic sparse pre-training of bert.
  20. Fast sparse convnets. arXiv.
  21. Rigging the lottery: Making all tickets winners. In ICML.
  22. The difficulty of training sparse neural networks. arXiv.
  23. Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR.
  24. The state of sparsity in deep neural networks. arXiv.
  25. The pile: An 800gb dataset of diverse text for language modeling. arXiv.
  26. Are wider nets better given the same number of parameters? In ICLR.
  27. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare.
  28. Abhay Gupta. 2024. Sparsity made easy - introducing the cerebras pytorch sparsity library. Cerebras Systems Blog.
  29. Deep learning scaling is predictable, empirically. arXiv.
  30. An empirical analysis of compute-optimal large language model training. In NeurIPS.
  31. Sara Hooker. 2020. The Hardware Lottery. arXiv.
  32. Lora: Low-rank adaptation of large language models. arXiv.
  33. Sparse is enough in scaling transformers. In NeurIPS.
  34. Pubmedqa: A dataset for biomedical research question answering. In EMNLP-IJCNLP.
  35. BioELECTRA:pretrained biomedical text encoder using discriminators. In Workshop on Biomedical Language Processing.
  36. Scaling laws for neural language models. arXiv.
  37. Moses: Open source toolkit for statistical machine translation. In ACL.
  38. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics.
  39. The stability-efficiency dilemma: Investigating sequence length warmup for training GPT models. In NeurIPS.
  40. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In ACL.
  41. Relora: High-rank training through low-rank updates. arXiv.
  42. Sean Lie. 2021. Thinking outside the die: Architecting the ml accelerator of the future.
  43. Sean Lie. 2022a. Cerebras architecture deep dive: First look inside the hw/sw co-design for deep learning. Cerebras Systems Blog.
  44. Sean Lie. 2022b. Harnessing the Power of Sparsity for Large GPT AI Models. Cerebras Systems Blog.
  45. Sparse training via boosting pruning plasticity with neuroregeneration.
  46. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. arXiv.
  47. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In ACL.
  48. Can large language models reason about medical questions? arXiv.
  49. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv.
  50. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics.
  51. Effective model sparsification by scheduled grow-and-prune methods. In ICLR.
  52. Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp. arXiv.
  53. ReLU strikes back: Exploiting activation sparsity in large language models. In ICLR.
  54. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications.
  55. Bethesda (MD) National Library of Medicine. 2003–2023. Pmc open access subset.
  56. NeuralMagic. 2021. Deepsparse.
  57. Capabilities of gpt-4 on medical challenge problems. arXiv.
  58. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv.
  59. A study of generative large language model for medical research and healthcare. NPJ Digital Medicine.
  60. What’s hidden in a randomly weighted neural network? In CVPR.
  61. Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. arXiv.
  62. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv.
  63. Large language models encode clinical knowledge. arXiv.
  64. Towards expert-level medical question answering with large language models. arXiv.
  65. 1-bit adam: Communication efficient large-scale training with adam’s convergence speed. arXiv.
  66. Torchsparse++: Efficient training and inference framework for sparse convolution on gpus. In IEEE/ACM MICRO.
  67. Galactica: A large language model for science. arXiv.
  68. SPDF: Sparse pre-training and dense fine-tuning for large language models. In UAI.
  69. Sparse iso-FLOP transformations for maximizing training efficiency. In NeurIPS Workshop on Advancing Neural Network Training.
  70. Clinical camel: An open expert-level medical language model with dialogue-based knowledge encoding. arXiv.
  71. Llama: Open and efficient foundation language models. arXiv.
  72. Llama 2: Open foundation and fine-tuned chat models. arXiv.
  73. Attention is all you need. In NeurIPS.
  74. Ziheng Wang. 2021. Sparsednn: Fast sparse deep learning inference on cpus. arXiv.
  75. Pmc-llama: Towards building open-source language models for medicine. arXiv.
  76. Bloomberggpt: A large language model for finance. arXiv.
  77. Medlm: Exploring language models for medical question answering systems. arXiv.
  78. Gatortron: A large language model for clinical natural language processing. medRxiv.
  79. Deep bidirectional language-knowledge graph pretraining. In NeurIPS.
  80. LinkBERT: Pretraining language models with document links. In ACL.
  81. Opt: Open pre-trained transformer language models. arXiv.
  82. Fine-tuning language models from human preferences. arXiv.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Vithursan Thangarasa (13 papers)
  2. Mahmoud Salem (5 papers)
  3. Shreyas Saxena (9 papers)
  4. Kevin Leong (3 papers)
  5. Joel Hestness (23 papers)
  6. Sean Lie (7 papers)
Citations (1)
Reddit Logo Streamline Icon: https://streamlinehq.com