2000 character limit reached
To Asymmetry and Beyond: Structured Pruning of Sequence to Sequence Models for Improved Inference Efficiency (2304.02721v3)
Published 5 Apr 2023 in cs.CL and cs.AI
Abstract: Sequence-to-sequence LLMs can be used to produce abstractive summaries which are coherent, relevant, and concise. Still, model sizes can make deployment in latency-sensitive or web-scale implementations difficult. This paper studies the relationship between model size, structured pruning, inference efficiency, and summarization accuracy on widely used summarization datasets. We show that model accuracy is tied to the encoder size while inference efficiency is connected to the decoder. Using asymmetric pruning can lead to nearly 3x improvement in inference latency with ~1 point loss in Rouge-2. Moreover, we find both the average degradation and the role of asymmetry to be consistent across model sizes and variations in datasets.
- Language models are few-shot learners. ArXiv, abs/2005.14165.
- Sparse*bert: Sparse models are robust. ArXiv, abs/2205.12452.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
- A deep neural network compression pipeline: Pruning, quantization, huffman encoding. ArXiv.
- Scaling laws for autoregressive generative modeling. ArXiv, abs/2010.14701.
- Distilling the knowledge in a neural network. ArXiv, abs/1503.02531.
- Training compute-optimal large language models. ArXiv, abs/2203.15556.
- Tinybert: Distilling bert for natural language understanding. ArXiv, abs/1909.10351.
- Scaling laws for neural language models. ArXiv, abs/2001.08361.
- Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. In International Conference on Learning Representations.
- Learned token pruning for transformers. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
- The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. ArXiv, abs/2203.07259.
- Block pruning for faster transformers. ArXiv, abs/2109.04838.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36:1234–1240.
- PAQ: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics, 9:1098–1115.
- Dq-bart: Efficient sequence-to-sequence model via joint distillation and quantization. In Annual Meeting of the Association for Computational Linguistics.
- Train large, then compress: Rethinking model size for efficient training and inference of transformers. ArXiv, abs/2002.11794.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Tsvetomila Mihaylova and André F. T. Martins. 2019. Scheduled sampling for transformers. ArXiv, abs/1906.07651.
- Train flat, then compress: Sharpness-aware minimization learns more compressible models. ArXiv, abs/2205.12694.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745.
- Oren Neumann and Claudius Gros. 2022. Scaling laws for a multi-agent reinforcement learning model. ArXiv, abs/2210.00849.
- Document ranking with a pretrained sequence-to-sequence model. In Findings.
- Alec Radford. 2018. Improving language understanding by generative pre-training.
- Robust speech recognition via large-scale weak supervision. ArXiv, abs/2212.04356.
- Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683.
- Sequence level training with recurrent neural networks. CoRR, abs/1511.06732.
- On the predictability of pruning across scales. ArXiv, abs/2006.10621.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Movement pruning: Adaptive sparsity by fine-tuning. ArXiv, abs/2005.07683.
- Confident adaptive language modeling. ArXiv, abs/2207.07061.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- Sam Shleifer and Alexander M. Rush. 2020. Pre-trained summarization distillation. ArXiv, abs/2010.13002.
- Sequence to sequence learning with neural networks. In NIPS.
- Scale efficiently: Insights from pre-training and fine-tuning transformers. ArXiv, abs/2109.10686.
- Attention is all you need. ArXiv, abs/1706.03762.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL.
- Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652.
- Deebert: Dynamic early exiting for accelerating bert inference. In ACL.
- A streaming approach for efficient batched beam search. In Conference on Empirical Methods in Natural Language Processing.
- Q8bert: Quantized 8bit bert. 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS), pages 36–39.
- Prune once for all: Sparse pre-trained language models. ArXiv, abs/2111.05754.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. ArXiv, abs/1912.08777.
- Daniel Campos (62 papers)
- ChengXiang Zhai (64 papers)