ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers (2310.02489v2)
Abstract: Memory constraint of always-on devices is one of the major concerns when deploying speech processing models on these devices. While larger models trained with sufficiently large amount of data generally perform better, making them fit in the device memory is a demanding challenge. In this paper, we aim to reduce model size by reparameterizing model weights across Transformer encoder layers and assuming a special weight composition and structure. More specifically, inspired by ResNet and the more recent LoRA work, we propose an approach named ResidualTransformer, where each weight matrix in a Transformer layer comprises 1) a shared full-rank component with its adjacent layers, and 2) a unique low-rank component to itself. The low-rank matrices only account for a small amount of model size increase. In addition, we add diagonal weight matrices to improve modeling capacity of the low-rank matrices. Experiments of our 10k-hour speech recognition and speech translation tasks show that the Transformer encoder size can be reduced by ~3X with very slight performance degradation.
- “Deep residual learning for image recognition,” in Proc. CVPR, 2016.
- “LoRA: Low-rank adaptation of large language models,” in Proc. ICLR, 2022.
- “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019.
- “Language models are few-shot learners,” in NeurIPS, 2020.
- “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, 2023.
- “Google USM: scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
- “EnergonAI: An inference system for 10-100 billion parameter transformer models,” arXiv preprint arXiv:2209.02341, 2022.
- Yiming Wang, Wake word detection and its applications, Ph.D. thesis, Johns Hopkins University, 2021.
- “Sharing low rank conformer weights for tiny always-on ambient speech recognition models,” in Proc. ICASSP, 2023.
- “Attention is all you need,” in Proc. NeurIPS, 2017.
- “Transformer-XL: Attentive language models beyond a fixed-length context,” in Proc. ACL, 2019.
- “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” in Proc. ICASSP, 2018.
- “Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,” in Proc. ICASSP, 2020.
- “Wake word detection with streaming transformers,” in Proc. ICASSP, 2021.
- “Distilling the knowledge in a neural network,” in Proc. NeurIPS Deep Learning Workshop, 2014.
- “Compression of end-to-end models,” in Proc. Interspeech, 2018.
- “Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network,” in Proc. ICASSP, 2014.
- “Lightweight and efficient end-to-end speech recognition using low-rank transformer,” in Proc. ICASSP, 2020.
- “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in Proc. ICLR, 2016.
- “4-bit conformer with native quantization aware training for speech recognition,” in Proc. Interspeech, 2022.
- “Learning both weights and connections for efficient neural network,” in Proc. NeurIPS, 2015.
- “To prune, or not to prune: Exploring the efficacy of pruning for model compression,” in Proc. ICLR Workshop Track, 2018.
- “Universal transformers,” in Proc. ICLR, 2019.
- “ALBERT: A lite BERT for self-supervised learning of language representations,” in Proc. ICLR, 2020.
- “Extremely low footprint end-to-end ASR system for smart device,” in Proc. Interspeech, 2021.
- “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020.
- “Lightformer: Light-weight transformer using svd-based weight transfer and parameter sharing,” in Proc. Findings of ACL, 2023.
- “Extended low-rank plus diagonal adaptation for deep and recurrent neural networks,” in Proc. ICASSP, 2017.
- “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proc. EMNLP: System Demonstrations, 2018.
- “CoVoST 2 and massively multilingual speech translation,” in Proc. Interspeech, 2021.
- “Microsoft speech language translation (MSLT) corpus: The IWSLT 2016 release for English, French and German,” in Proc. IWSLT, 2016.
- Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
- “Developing real-time streaming transformer transducer for speech recognition on large-scale dataset,” in Proc. ICASSP, 2021.