LeaPformer: Enabling Linear Transformers for Autoregressive and Simultaneous Tasks via Learned Proportions (2405.13046v1)
Abstract: A promising approach to preserving model performance in linearized transformers is to employ position-based re-weighting functions. However, state-of-the-art re-weighting functions rely heavily on target sequence lengths, making it difficult or impossible to apply them to autoregressive and simultaneous tasks, where the target and sometimes even the input sequence length are unknown. To address this issue, we propose Learned Proportions (LeaP) and LeaPformers. Our contribution is built on two major components. First, we generalize the dependence on explicit positional representations and sequence lengths into dependence on sequence proportions for re-weighting. Second, we replace static positional representations with dynamic proportions derived via a compact module, enabling more flexible attention concentration patterns. We evaluate LeaPformer against eight representative efficient transformers on the Long-Range Arena benchmark, showing that LeaPformer achieves the best quality-throughput trade-off, as well as LeaPformer to Wikitext-103 autoregressive LLMing and simultaneous speech-to-text translation for two language pairs, achieving competitive results.
- Improving autoregressive nlp tasks via modular linearized attention. In Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., and Bonchi, F. (eds.), Machine Learning and Knowledge Discovery in Databases: Research Track, pp. 90–106, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-43421-1.
- Etc: Encoding long and structured inputs in transformers, 2020. URL https://arxiv.org/abs/2004.08483.
- Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ByxZX20qFQ.
- Longformer: The long-document transformer, 2020. URL https://arxiv.org/abs/2004.05150.
- Must-c: A multilingual corpus for end-to-end speech translation. Computer Speech & Language, 66:101155, 2021. ISSN 0885-2308. doi: https://doi.org/10.1016/j.csl.2020.101155. URL https://www.sciencedirect.com/science/article/pii/S0885230820300887.
- Skyformer: Remodel self-attention with gaussian kernel and nyström method. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021.
- Generating long sequences with sparse transformers, 2019. URL https://arxiv.org/abs/1904.10509.
- Rethinking attention with performers, 2020. URL https://arxiv.org/abs/2009.14794.
- Transformer-xl: Attentive language models beyond a fixed-length context, 2019. URL https://arxiv.org/abs/1901.02860.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Mamba: Linear-time sequence modeling with selective state spaces, 2023.
- Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR), 2022.
- Espnet-st: All-in-one speech translation toolkit, 2020. URL https://arxiv.org/abs/2004.10234.
- Transformers are rnns: Fast autoregressive transformers with linear attention, 2020. URL https://arxiv.org/abs/2006.16236.
- Reformer: The efficient transformer, 2020. URL https://arxiv.org/abs/2001.04451.
- Neural architecture search on efficient transformers and beyond, 2022.
- Relative positional encoding for Transformers with linear complexity. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 7067–7079. PMLR, 18–24 Jul 2021. URL http://proceedings.mlr.press/v139/liutkus21a.html.
- Stacl: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3025–3036, Florence, Italy, 2019. Association for Computational Linguistics (ACL).
- Simuleval: An evaluation toolkit for simultaneous translation, 2020a. URL https://arxiv.org/abs/2007.16193.
- Monotonic multihead attention. In International Conference on Learning Representations, 2020b.
- Simulmt to simulst: Adapting simultaneous text translation to end-to-end simultaneous speech translation. In Proceedings of 2020 Asia-Pacific Chapter of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, 2020c.
- Pointer sentinel mixture models. ArXiv, abs/1609.07843, 2016. URL https://api.semanticscholar.org/CorpusID:16299141.
- fairseq: A fast, extensible toolkit for sequence modeling, 2019. URL https://arxiv.org/abs/1904.01038.
- Image transformer, 2018. URL https://arxiv.org/abs/1802.05751.
- Random feature attention, 2021. URL https://arxiv.org/abs/2103.02143.
- Post, M. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191, Belgium, Brussels, October 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-6319.
- Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
- The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7025–7041, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.473.
- cosformer: Rethinking softmax in attention. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=Bl8CQrx2Up4.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
- Roformer: Enhanced transformer with rotary position embedding, 2022.
- Sparse sinkhorn attention, 2020. URL https://arxiv.org/abs/2002.11296.
- Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qVyeW-grC2k.
- Efficient transformers: A survey. ACM Comput. Surv., 55(6), dec 2022. ISSN 0360-0300. doi: 10.1145/3530811. URL https://doi.org/10.1145/3530811.
- Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Encoding word order in complex embeddings. ArXiv, abs/1912.12333, 2019. URL https://api.semanticscholar.org/CorpusID:209516262.
- CoVoST: A diverse multilingual speech-to-text translation corpus. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 4197–4203, Marseille, France, May 2020a. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://www.aclweb.org/anthology/2020.lrec-1.517.
- Linformer: Self-attention with linear complexity, 2020b. URL https://arxiv.org/abs/2006.04768.
- What do position embeddings learn? an empirical study of pre-trained language model positional encoding. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6840–6849, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.555. URL https://aclanthology.org/2020.emnlp-main.555.
- Memformer: A memory-augmented transformer for sequence modeling, 2020. URL https://arxiv.org/abs/2010.06891.
- Nyströmformer: A nyström-based algorithm for approximating self-attention. 2021.
- Big bird: Transformers for longer sequences. 2020. doi: 10.48550/ARXIV.2007.14062. URL https://arxiv.org/abs/2007.14062.
- Long short-term attention. In Ren, J., Hussain, A., Zhao, H., Huang, K., Zheng, J., Cai, J., Chen, R., and Xiao, Y. (eds.), Advances in Brain Inspired Cognitive Systems, pp. 45–54, Cham, 2020. Springer International Publishing. ISBN 978-3-030-39431-8.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.