Papers
Topics
Authors
Recent
2000 character limit reached

Augmenting conformers with structured state-space sequence models for online speech recognition

Published 15 Sep 2023 in cs.CL, cs.SD, and eess.AS | (2309.08551v2)

Abstract: Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems. In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. We performed systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions. We found that the most effective design is to stack a small S4 using real-valued recurrent weights with a local convolution, allowing them to work complementarily. Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, 2012.
  2. “Long Short-Term Memory,” Neural Computation, 1997.
  3. “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, 1998.
  4. “Attention is all you need,” in NeurIPS, 2017.
  5. “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
  6. “Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding,” in ICML, 2022.
  7. “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396, 2021.
  8. “Diagonal state spaces are as effective as structured state spaces,” NeurIPS, 2022.
  9. “On the parameterization and initialization of diagonal state space models,” NeurIPS, 2022.
  10. “Mega: Moving average equipped gated attention,” in The Eleventh International Conference on Learning Representations, 2023.
  11. “Simplified state space layers for sequence modeling,” in The Eleventh International Conference on Learning Representations, 2023.
  12. “Combining recurrent, convolutional, and continuous-time models with linear state space layers,” in NeurIPS, 2021.
  13. “Multi-head state space model for speech recognition,” in Interspeech, 2023.
  14. “Diagonal state space augmented transformers for speech recognition,” in ICASSP, 2023.
  15. “Structured state space decoder for speech recognition and synthesis,” in ICASSP, 2023.
  16. “Parallelizing linear recurrent neural nets over sequence length,” arXiv preprint arXiv:1709.04057, 2017.
  17. “Hyena hierarchy: Towards larger convolutional language models,” arXiv preprint arXiv:2302.10866, 2023.
  18. “Hungry hungry hippos: Towards language modeling with state space models,” arXiv preprint arXiv:2212.14052, 2022.
  19. “Librispeech: An ASR corpus based on public domain audio books,” in ICASSP, 2015.
  20. Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  21. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, 2020.
  22. “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
  23. Alex Graves, “Practical variational inference for neural networks,” Advances in neural information processing systems, vol. 24, 2011.
  24. “Transformer-xl: Attentive language models beyond a fixed-length context,” arXiv preprint arXiv:1901.02860, 2019.
Citations (4)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.