2000 character limit reached
Efficiently Distilling LLMs for Edge Applications (2404.01353v1)
Published 1 Apr 2024 in cs.LG, cs.AI, and cs.CL
Abstract: Supernet training of LLMs is of great interest in industrial applications as it confers the ability to produce a palette of smaller models at constant cost, regardless of the number of models (of different size / latency) produced. We propose a new method called Multistage Low-rank Fine-tuning of Super-transformers (MLFS) for parameter-efficient supernet training. We show that it is possible to obtain high-quality encoder models that are suitable for commercial edge applications, and that while decoder-only models are resistant to a comparable degree of compression, decoders can be effectively sliced for a significant reduction in training time.
- Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
- BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.1. URL https://aclanthology.org/2022.acl-short.1.
- Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532, 2019.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
- Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a.
- Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12270–12280, 2021b.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Prior-guided one-shot neural architecture search. arXiv preprint arXiv:2206.13329, 2022.
- Autobert-zero: Evolving bert backbone from scratch. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36-10, pages 10663–10671, 2022.
- Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
- Single path one-shot neural architecture search with uniform sampling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 544–560. Springer, 2020.
- Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33:9782–9793, 2020.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Mixture-of-supernets: Improving weight-sharing supernet training with architecture-routed mixture-of-experts. arXiv preprint arXiv:2306.04845, 2023.
- TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.372. URL https://aclanthology.org/2020.findings-emnlp.372.
- The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533, 2022.
- Transfer-once-for-all: AI model optimization for edge. In IEEE International Conference on Edge Computing and Communications, EDGE 2023, Chicago, IL, USA, July 2-8, 2023, pages 26–35. IEEE, 2023. URL https://doi.org/10.1109/EDGE60047.2023.00017.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
- Dynamic-ofa: Runtime dnn architecture switching for performance scaling on heterogeneous embedded platforms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3110–3118, 2021.
- A tensorized transformer for language modeling. Advances in neural information processing systems, 32, 2019.
- Structured pruning of a bert-based question answering model. arXiv preprint arXiv:1910.06360, 2019.
- Xtremedistil: Multi-stage distillation for massive multilingual models. arXiv preprint arXiv:2004.05686, 2020.
- Xtremedistiltransformers: Task transfer for task-agnostic distillation. arXiv preprint arXiv:2106.04563, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, volume 33-01, pages 4780–4789, 2019.
- Code llama: Open foundation models for code, 2023.
- Weight subcloning: direct initialization of transformers using larger pretrained ones, 2023.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34-05, pages 8815–8821, 2020.
- Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355, 2019.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446.
- Attentivenas: Improving neural architecture search via attentive sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6418–6427, 2021.
- Hat: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187, 2020.
- Lighthubert: Lightweight and configurable speech representation learning with once-for-all hidden-unit bert. arXiv preprint arXiv:2203.15610, 2022.
- Nas-bert: task-agnostic and adaptive-size bert compression with neural architecture search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1933–1943, 2021.
- Bignas: Scaling up neural architecture search with big single-stage models. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 702–717. Springer, 2020.
- Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 36–39. IEEE, 2019.
- Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023.
- You only compress once: Towards effective and elastic bert compression via exploit-explore stochastic nature gradient. arXiv preprint arXiv:2106.02435, 2021.
- Eena: efficient evolution of neural architecture. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
- A survey on model compression for large language models. arXiv preprint arXiv:2308.07633, 2023.
- Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.