SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling (2309.12578v1)
Abstract: Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very computationally demanding. Prior efforts to sparsify the Transformer have either used a fixed pattern or data-driven approach to reduce the number of operations involving the computation of multi-head attention, which is the main bottleneck of the Transformer. However, existing methods suffer from inevitable problems, such as the potential loss of essential sequence features due to the uniform fixed pattern applied across all layers, and an increase in the model size resulting from the use of additional parameters to learn sparsity patterns in attention operations. In this paper, we propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method to efficiently capture the layer-wise sparse pattern in attention operations. Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training. Efficient implementations of the layer-wise sparsified attention algorithm on GPUs are developed, demonstrating a new SPION that achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models, with better evaluation quality.
- ETC: Encoding long and structured inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 268–284.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations.
- RONALD Goldman. 1990. Graphics gems. Graphics gems (1990), 304.
- Reformer: The efficient transformer. In Proceedings of the International Conference on Learning Representations.
- Learning multiple layers of features from tiny images. (2009).
- Block Pruning For Faster Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 10619–10629.
- Nikita Nangia and Samuel R Bowman. 2018. Listops: A diagnostic dataset for latent tree learning. arXiv preprint arXiv:1804.06028 (2018).
- Greedy-layer pruning: Speeding up transformer models for natural language processing. Pattern Recognition Letters 157 (2022), 76–82.
- Blockwise self-attention for long document understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020. 2555–2565.
- The ACL anthology network corpus. Language Resources and Evaluation 47 (2013), 919–944.
- Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 9 (2021), 53–68.
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006 (2020).
- Efficient transformers: A survey. Comput. Surveys 55, 6 (2022), 1–28.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020).
- Structured Pruning of Large Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6151–6162.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems 33 (2020), 17283–17297.
- Poolingformer: Long document modeling with pooling attention. In International Conference on Machine Learning. PMLR, 12437–12446.