Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks (2012.02030v3)
Abstract: Attention mechanisms play a crucial role in the neural revolution of NLP. With the growth of attention-based models, several pruning techniques have been developed to identify and exploit sparseness, making these models more efficient. Most efforts focus on hard-coding attention patterns or pruning attention weights based on training data. We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask. AP saves 90% of attention computation for LLMing and about 50% for machine translation and GLUE tasks, maintaining result quality. Our method reveals important distinctions between self- and cross-attention patterns, guiding future NLP research. Our framework can reduce both latency and memory requirements for any attention-based model, aiding in the development of improved models for existing or new NLP applications. We have demonstrated this with encoder and autoregressive transformer models using Triton GPU kernels and make our code publicly available at https://github.com/irugina/AP.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR ’15, San Diego, CA, USA.
- Christopher T. H. Baker. 1979. Numerical Integration in the Treatment of Integral Equations, pages 44–53. Birkhäuser Basel, Basel.
- Longformer: The long-document Transformer. arXiv preprint arXiv:2004.05150.
- The fifth pascal recognizing textual entailment challenge. In Proceedings of Text Analysis Conference, TAC ’09, Gaithersburg, Maryland, USA.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
- SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics.
- Quora question pairs.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- What does BERT look at? an analysis of BERT’s attention. In Proceedings of the Second BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, ACL ’19, pages 276–286, Florence, Italy.
- Adaptively sparse transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP ’19, pages 2174–2184, Hong Kong, China.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, EMNLP ’19, Florence, Italy.
- William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
- Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Proceedings of the 7th International Conference on Learning Representations, ICLR ’19, New Orleans, LA, USA.
- Compressing large-scale transformer-based models: A case study on bert. arXiv prerint arXiv:2002.11985.
- GPU kernels for block-sparse weights.
- Teaching machines to read and comprehend. In Proceedings of the Annual Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems 28, pages 1693–1701.
- Reformer: The efficient Transformer. In Proceedings of the 8th International Conference on Learning Representations, ICLR ’20, Virtual Conference.
- Feed-forward blocks control contextualization in masked language models. arXiv preprint arXiv:2302.00456.
- Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4365–4374, Hong Kong, China. Association for Computational Linguistics.
- How does attention work in vision transformers? a visual analytics attempt. IEEE Transactions on Visualization and Computer Graphics.
- Pointer sentinel mixture models. In Proceedings of the 5th International Conference on Learning Representations, ICLR ’17, Toulon, France.
- Teaching machines to read and comprehend. In Proceedings of the Annual Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems 32, NIPS ’19, pages 14014–14024.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), NAACL-HLT ’19, pages 48–53, Minneapolis, Minnesota.
- Sparse sequence-to-sequence models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL ’19, pages 1504–1519, Florence, Italy.
- Language models are unsupervised multitask learners.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Fixed encoder self-attention patterns in transformer-based machine translation. In Findings of the Association for Computational Linguistics, EMNLP ’20, pages 556–568, Virtual Conference.
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, page arXiv:1606.05250.
- Squad: 100,000+ questions for machine comprehension of text.
- B. Randell. 1971. Ludgate’s analytical machine of 1909. The Computer Journal, 14:317–326.
- Attention-likelihood relationship in transformers. arXiv preprint arXiv:2303.08288.
- Poor man’s bert: Smaller and faster transformer models. arXiv preprint arXiv:2004.03844.
- Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Triton: An intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10–19, New York, NY, USA. Association for Computing Machinery.
- Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems, NIPS ’17, pages 5998–6008, Long Beach, CA, US.
- Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model. In Proceedings of the Second BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, ACL ’19, pages 63–76, Florence, Italy.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL ’19, pages 5797–5808, Florence, Italy.
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 7th International Conference on Learning Representations, ICLR ’19, New Orleans, LA, USA.
- Linformer: Self-attention with linear complexity.
- Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
- A broad-coverage challenge corpus for sentence understanding through inference.
- Nyströmformer: A nyström-based algorithm for approximating self-attention.
- Attentionviz: A global view of transformer attention. arXiv preprint arXiv:2305.03210.
- Hard-coded Gaussian attention for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7689–7700, Online. Association for Computational Linguistics.
- Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP. In Proceedings of the 8th International Conference on Learning Representations, ICLR ’20, Virtual Conference.
- Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc.
- Better pre-training by reducing representation confusion. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2280–2291.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.