Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena (2402.13208v1)

Published 20 Feb 2024 in cs.CL and cs.AI

Abstract: The attention mechanism, a cornerstone of state-of-the-art neural models, faces computational hurdles in processing long sequences due to its quadratic complexity. Consequently, research efforts in the last few years focused on finding more efficient alternatives. Among them, Hyena (Poli et al., 2023) stands out for achieving competitive results in both LLMing and image classification, while offering sub-quadratic memory and computational complexity. Building on these promising results, we propose ConfHyena, a Conformer whose encoder self-attentions are replaced with an adaptation of Hyena for speech processing, where the long input sequences cause high computational costs. Through experiments in automatic speech recognition (for English) and translation (from English into 8 target languages), we show that our best ConfHyena model significantly reduces the training time by 27%, at the cost of minimal quality degradation (~1%), which, in most cases, is not statistically significant.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Efficient transformer for direct speech translation.
  2. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4945–4949.
  3. End-to-End Automatic Speech Translation of Audiobooks. In Proceedings of ICASSP 2018 - IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Alberta, Canada.
  4. Ron Bracewell and Peter B. Kahn. 1966. The Fourier Transform and Its Applications. American Journal of Physics, 34(8):712–712.
  5. Must-c: A multilingual corpus for end-to-end speech translation. Computer Speech & Language, 66:101155.
  6. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964.
  7. Attention-based models for speech recognition. Advances in neural information processing systems, 28.
  8. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy.
  9. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
  10. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 933–941. JMLR.org.
  11. Paul de Laat. 2021. Companies committed to responsible ai: From principles towards implementation and regulation? Philosophy & Technology, 34.
  12. Enhancing Transformer for End-to-end Speech-to-Text Translation. In Proceedings of Machine Translation Summit XVII Volume 1: Research Track, pages 21–31, Dublin, Ireland. European Association for Machine Translation.
  13. Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is in the eye of the user: A critique of NLP leaderboards. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4846–4853, Online. Association for Computational Linguistics.
  14. CTC-based compression for direct speech translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 690–696, Online.
  15. Convolutional networks. In Deep learning, chapter 9, page 330–372. MIT Press, Cambridge, MA, USA.
  16. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd international conference on Machine learning (ICML), pages 369–376, Pittsburgh, Pennsylvania.
  17. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pages 5036–5040.
  18. Non-autoregressive end-to-end speech translation with parallel autoregressive rescoring. arXiv preprint arXiv:2109.04411.
  19. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 448–456. JMLR.org.
  20. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395, Barcelona, Spain. Association for Computational Linguistics.
  21. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium.
  22. Transformers in speech processing: A survey. arXiv preprint arXiv:2303.11607.
  23. FNet: Mixing tokens with Fourier transforms. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4296–4313, Seattle, United States. Association for Computational Linguistics.
  24. Unraveling the hidden environmental impacts of ai solutions for environment. arXiv preprint arXiv:2110.11822.
  25. A survey of transformers. AI Open.
  26. Bridging the modality gap for speech-to-text translation.
  27. Understanding and improving transformer from a multi-particle dynamic system point of view.
  28. Alan V. Oppenheim and Ronald W. Schafer. 1975. Digital Signal Processing. Prentice Hall international editions. Prentice-Hall.
  29. Alan V. Oppenheim and Ronald W. Schafer. 2009. Discrete-Time Signal Processing, 3rd edition. Prentice Hall Press, USA.
  30. Speechformer: Reducing information loss in direct speech translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1698–1706, Online and Punta Cana, Dominican Republic.
  31. When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP.
  32. Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation.
  33. Reproducing whisper-style training using an open-source toolkit and publicly available data.
  34. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning.
  35. Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels.
  36. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
  37. Searching for activation functions.
  38. Tackling climate change with machine learning. ACM Computing Surveys (CSUR), 55(2):1–96.
  39. Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1835–1841, Florence, Italy. Association for Computational Linguistics.
  40. Ivan W Selesnick and C Sidney Burrus. 1997. Fast Convolution and Filtering. CRC Press.
  41. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy. Association for Computational Linguistics.
  42. Efficient transformers: A survey. ACM Comput. Surv., 55(6).
  43. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  44. Encoding word order in complex embeddings. In International Conference on Learning Representations.
  45. fairseq s2t: Fast speech-to-text modeling with fairseq. In Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations.
  46. Aimee Wynsberghe. 2021. Sustainable ai: Ai for sustainability and the sustainability of ai. AI and Ethics, 1.
  47. Adaptive feature selection for end-to-end speech translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2533–2544, Online. Association for Computational Linguistics.
  48. Google usm: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037.
  49. RedApt: An adaptor for wav2vec 2 encodingfaster and smaller speech translation without quality compromise. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1960–1967, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Marco Gaido (47 papers)
  2. Sara Papi (33 papers)
  3. Matteo Negri (93 papers)
  4. Luisa Bentivogli (38 papers)
Citations (1)

Summary

Exploring Efficient Speech Processing with ConfHyena

Introduction to ConfHyena

Recent advancements in speech processing have leaned heavily on attention-based models, such as the Conformer, which have demonstrated significant success in automatic speech recognition (ASR) and speech translation (ST). However, these models grapple with high computational costs primarily due to the quadratic complexity of the attention mechanism, which becomes particularly pronounced in tasks involving long input sequences. In light of these challenges, the work by Poli et al. introduces an innovative model known as ConfHyena. This model is designed to supplant the encoder self-attentions in Conformers with an adaptation of the Hyena operator, specifically engineered for handling speech processing tasks efficiently.

Background and Theoretical Foundations

Self-Attention and Its Limitations

At the heart of many state-of-the-art neural architectures lies the self-attention mechanism, noted for its ability to capture dependencies in input sequences. Despite its effectiveness, the quadratic computational and memory requirements of self-attention limit its applicability in scenarios that involve long sequences, such as those commonly found in speech processing tasks.

The Hyena Operator

In response to these limitations, the Hyena operator was developed, offering a compelling alternative to traditional attention mechanisms by maintaining competitive performance levels while significantly reducing computational complexity. The Hyena operator employs a combination of implicitly parametrized long convolutions and data-controlled gating to achieve sub-quadratic complexity, representing a potential breakthrough for efficient speech processing.

The ConfHyena Model

Building upon the foundational work of the Hyena operator, ConfHyena incorporates this mechanism into the Conformer architecture, specifically within the encoder to address the computational inefficiencies associated with long input sequences in speech-related tasks. The research introduces two variants of the model: the standard ConfHyena and the Hybrid ConfHyena. The latter integrates Hyena operators in the initial layers of the encoder while retaining self-attention mechanisms in the subsequent layers, following a CTC-compression module that reduces the redundancy of intermediate encodings.

Empirical Evaluations

Performance Metrics

The paper evaluates the performance of ConfHyena models across several benchmarks, focusing on English ASR and translation tasks into eight different languages. The results reveal that the Hybrid ConfHyena model achieves a notable reduction in training time by 27%, with only a minimal and often statistically insignificant degradation in quality compared to the baseline Conformer model.

Training and Inference Efficiency

An integral part of the paper's contribution lies in its thorough analysis of model efficiency. Notably, Hybrid ConfHyena significantly outperforms the baseline in terms of both training and inference efficiency, offering a much-needed solution to the high computational demands of state-of-the-art speech processing models without substantially compromising output quality.

Future Directions and Implications

The paper opens up several avenues for future research, particularly in exploring the potential of reduced downsampling strategies and their impact on performance and efficiency. Additionally, the implications of adopting models like ConfHyena extend beyond mere technical efficiency; they resonate with broader considerations around environmental sustainability, cost-effectiveness, and the democratization of AI technologies.

Conclusion

In summary, ConfHyena represents a significant step forward in the pursuit of more efficient speech processing models. By integrating the Hyena operator into the encoder of Conformer architectures, the model achieves substantial reductions in computational costs while maintaining competitive performance levels. As AI continues to evolve, such innovations underscore the importance of balancing efficiency with efficacy, ensuring that advanced capabilities remain accessible and sustainable.

X Twitter Logo Streamline Icon: https://streamlinehq.com