Head-wise Shareable Attention for Large Language Models (2402.11819v3)
Abstract: LLMs suffer from huge number of parameters, which restricts their deployment on edge devices. Weight sharing is one promising solution that encourages weight reuse, effectively reducing memory usage with less performance drop. However, current weight sharing techniques primarily focus on small-scale models like BERT and employ coarse-grained sharing rules, e.g., layer-wise. This becomes limiting given the prevalence of LLMs and sharing an entire layer or block obviously diminishes the flexibility of weight sharing. In this paper, we present a perspective on head-wise shareable attention for LLMs. We further propose two memory-efficient methods that share parameters across attention heads, with a specific focus on LLMs. Both of them use the same dynamic strategy to select the shared weight matrices. The first method directly reuses the pre-trained weights without retraining, denoted as $\textbf{DirectShare}$. The second method first post-trains with constraint on weight matrix similarity and then shares, denoted as $\textbf{PostShare}$. Experimental results reveal our head-wise shared models still maintain satisfactory capabilities, demonstrating the feasibility of fine-grained weight sharing applied to LLMs.
- Binarybert: Pushing the limit of bert quantization. arXiv preprint arXiv:2012.15701.
- Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Leveraging redundancy in attention with reuse transformers. arXiv preprint arXiv:2110.06821.
- Piqa: Reasoning about physical commonsense in natural language.
- Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Audio albert: A lite bert for self-supervised learning of audio representation. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 344–350. IEEE.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
- OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
- Raj Dabre and Atsushi Fujita. 2019. Recurrent stacking of layers for compact neural machine translation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6292–6299.
- Universal transformers. arXiv preprint arXiv:1807.03819.
- Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Wikimedia Foundation. Wikimedia downloads.
- Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- Small pre-trained language models can be fine-tuned as large models via over-parameterization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3819–3834.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Ocnli: Original chinese natural language inference. arXiv preprint arXiv:2010.05444.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451.
- Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Citeseer.
- Csl: A large-scale chinese scientific literature dataset. arXiv preprint arXiv:2209.05034.
- Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on machine learning, pages 5958–5968. PMLR.
- Enhancing scalability of pre-trained language models via efficient parameter sharing. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13771–13785.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Lightformer: Light-weight transformer using svd-based weight transfer and parameter sharing. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10323–10335.
- Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
- Subformer: Exploring weight sharing for parameter efficiency in generative transformers. arXiv preprint arXiv:2101.00234.
- Exploring attention map reuse for efficient transformer neural networks. arXiv preprint arXiv:2301.12444.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
- Sho Takase and Shun Kiyono. 2021. Lessons on parameter sharing across layers in transformers. arXiv preprint arXiv:2104.06022.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Gkd: A general knowledge distillation framework for large-scale pre-trained language model. arXiv preprint arXiv:2306.06629.
- Structured pruning for efficient generative pre-trained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10880–10895.
- Compression of generative pre-trained language models via quantization. arXiv preprint arXiv:2203.10705.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Jesse Vig. 2019. A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714.
- Superglue: A stickier benchmark for general-purpose language understanding systems.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Ad-kd: Attribution-driven knowledge distillation for language model compression. arXiv preprint arXiv:2305.10010.
- Tied transformers: Neural machine translation with shared encoder and decoder. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 5466–5473.
- Sharing attention weights for fast transformer. arXiv preprint arXiv:1906.11024.
- Canwen Xu and Julian McAuley. 2023. A survey on model compression and acceleration for pretrained language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10566–10575.
- Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986.
- Gradient-based intra-attention pruning on pre-trained language models. arXiv preprint arXiv:2212.07634.
- Lazyformer: Self attention with lazy update. arXiv preprint arXiv:2102.12702.
- Minivit: Compressing vision transformers with weight multiplexing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12145–12154.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886.
- Michael Zhu and Suyog Gupta. 2017. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878.