Transformer tricks: Removing weights for skipless transformers (2404.12362v1)
Abstract: He and Hofmann (arXiv:2311.01906) detailed a skipless transformer without the V and P (post-attention projection) linear layers, which reduces the total number of weights. However, this scheme is only applicable to MHA (multi-head attention), but not for MQA (multi-query attention) and GQA (grouped-query attention). The latter schemes are used by many popular LLMs such as Llama 2, Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes mathematically equivalent versions that are suitable for MQA and GQA. For example, removing Q and P from a skipless version of Mistral-7B would remove 15% of its weights (and thus reduce its compute and memory complexity). See arXiv:2402.13388 and https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.
- Simplifying Transformer Blocks. November 2023. arXiv:2311.01906.
- Attention is all you need. June 2017. arXiv:1706.03762.
- Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. November 2019. arXiv:1911.02150.
- GQA: Training generalized multi-query transformer models from multi-head checkpoints. May 2023. arXiv:2305.13245.
- Llama 2: Open foundation and fine-tuned chat models. July 2023. arXiv:2307.09288.
- Mistral 7B. October 2023. arXiv:2310.06825.
- Mixtral of Experts. January 2024. arXiv:2401.04088.
- PaLM: Scaling language modeling with Pathways. April 2022. arXiv:2204.02311.
- Gemma Team, Google DeepMind. Gemma: Open Models Based on Gemini Research and Technology. 2024.
- Frank Elavsky. The Micro-Paper: Towards cheaper, citable research ideas and conversations. February 2023. arXiv:2302.12854.
- OpenMachine. Transformer tricks. 2024. Github repository.
- Nils Graef. Transformer tricks: Precomputing the first layer. February 2024. arXiv:2402.13388.
- Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation. February 2023. arXiv:2302.10322. And ICLR 2023.
- Wikipedia. Invertible matrix, 2024. Accessed Mar-2024.
- Noam Shazeer. GLU Variants Improve Transformer. February 2020. arXiv:2002.05202.
- GPT-J-6B: A 6 billion parameter autoregressive language model. 2021. Github repo.