How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena (2402.13208v1)
Abstract: The attention mechanism, a cornerstone of state-of-the-art neural models, faces computational hurdles in processing long sequences due to its quadratic complexity. Consequently, research efforts in the last few years focused on finding more efficient alternatives. Among them, Hyena (Poli et al., 2023) stands out for achieving competitive results in both LLMing and image classification, while offering sub-quadratic memory and computational complexity. Building on these promising results, we propose ConfHyena, a Conformer whose encoder self-attentions are replaced with an adaptation of Hyena for speech processing, where the long input sequences cause high computational costs. Through experiments in automatic speech recognition (for English) and translation (from English into 8 target languages), we show that our best ConfHyena model significantly reduces the training time by 27%, at the cost of minimal quality degradation (~1%), which, in most cases, is not statistically significant.
- Efficient transformer for direct speech translation.
- End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4945–4949.
- End-to-End Automatic Speech Translation of Audiobooks. In Proceedings of ICASSP 2018 - IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Alberta, Canada.
- Ron Bracewell and Peter B. Kahn. 1966. The Fourier Transform and Its Applications. American Journal of Physics, 34(8):712–712.
- Must-c: A multilingual corpus for end-to-end speech translation. Computer Speech & Language, 66:101155.
- Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964.
- Attention-based models for speech recognition. Advances in neural information processing systems, 28.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
- Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 933–941. JMLR.org.
- Paul de Laat. 2021. Companies committed to responsible ai: From principles towards implementation and regulation? Philosophy & Technology, 34.
- Enhancing Transformer for End-to-end Speech-to-Text Translation. In Proceedings of Machine Translation Summit XVII Volume 1: Research Track, pages 21–31, Dublin, Ireland. European Association for Machine Translation.
- Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is in the eye of the user: A critique of NLP leaderboards. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4846–4853, Online. Association for Computational Linguistics.
- CTC-based compression for direct speech translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 690–696, Online.
- Convolutional networks. In Deep learning, chapter 9, page 330–372. MIT Press, Cambridge, MA, USA.
- Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd international conference on Machine learning (ICML), pages 369–376, Pittsburgh, Pennsylvania.
- Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pages 5036–5040.
- Non-autoregressive end-to-end speech translation with parallel autoregressive rescoring. arXiv preprint arXiv:2109.04411.
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 448–456. JMLR.org.
- Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395, Barcelona, Spain. Association for Computational Linguistics.
- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium.
- Transformers in speech processing: A survey. arXiv preprint arXiv:2303.11607.
- FNet: Mixing tokens with Fourier transforms. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4296–4313, Seattle, United States. Association for Computational Linguistics.
- Unraveling the hidden environmental impacts of ai solutions for environment. arXiv preprint arXiv:2110.11822.
- A survey of transformers. AI Open.
- Bridging the modality gap for speech-to-text translation.
- Understanding and improving transformer from a multi-particle dynamic system point of view.
- Alan V. Oppenheim and Ronald W. Schafer. 1975. Digital Signal Processing. Prentice Hall international editions. Prentice-Hall.
- Alan V. Oppenheim and Ronald W. Schafer. 2009. Discrete-Time Signal Processing, 3rd edition. Prentice Hall Press, USA.
- Speechformer: Reducing information loss in direct speech translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1698–1706, Online and Punta Cana, Dominican Republic.
- When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP.
- Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation.
- Reproducing whisper-style training using an open-source toolkit and publicly available data.
- Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning.
- Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels.
- Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
- Searching for activation functions.
- Tackling climate change with machine learning. ACM Computing Surveys (CSUR), 55(2):1–96.
- Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1835–1841, Florence, Italy. Association for Computational Linguistics.
- Ivan W Selesnick and C Sidney Burrus. 1997. Fast Convolution and Filtering. CRC Press.
- Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy. Association for Computational Linguistics.
- Efficient transformers: A survey. ACM Comput. Surv., 55(6).
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Encoding word order in complex embeddings. In International Conference on Learning Representations.
- fairseq s2t: Fast speech-to-text modeling with fairseq. In Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations.
- Aimee Wynsberghe. 2021. Sustainable ai: Ai for sustainability and the sustainability of ai. AI and Ethics, 1.
- Adaptive feature selection for end-to-end speech translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2533–2544, Online. Association for Computational Linguistics.
- Google usm: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037.
- RedApt: An adaptor for wav2vec 2 encodingfaster and smaller speech translation without quality compromise. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1960–1967, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Marco Gaido (47 papers)
- Sara Papi (33 papers)
- Matteo Negri (93 papers)
- Luisa Bentivogli (38 papers)