Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 465 tok/s Pro
Claude Sonnet 4 30 tok/s Pro
2000 character limit reached

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations (2407.03495v1)

Published 3 Jul 2024 in eess.AS, cs.CL, and cs.LG

Abstract: Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models. In this work, we present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for codec training such as quantization schemes and time-domain vs spectral feature encodings. We further explore ASR training techniques aimed at enhancing performance, training efficiency, and noise robustness. Drawing upon our findings, we introduce a codec ASR pipeline that outperforms Encodec at similar bit-rate. Remarkably, it also surpasses the state-of-the-art results achieved by strong self-supervised models on the 143 languages ML-SUPERB benchmark despite being smaller in size and pretrained on significantly less data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. G. Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
  2. W. Chan et al., “Speechstew: Simply mix all available speech recognition data to train one large neural network,” arXiv preprint arXiv:2104.02133, 2021.
  3. T. J. Park et al., “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, 2022.
  4. Q. Zhang et al., “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020.
  5. A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020.
  6. C. Wang et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  7. X. Wang et al., “SpeechX: Neural codec language model as a versatile speech transformer,” arXiv preprint arXiv:2308.06873, 2023.
  8. T. N. Sainath et al., “Multichannel signal processing with deep neural networks for automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 965–979, 2017.
  9. Y. Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2018, pp. 696–700.
  10. M. Won et al., “Data-driven harmonic filters for audio representation learning,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020.
  11. N. Zeghidour et al., “LEAF: A learnable frontend for audio classification,” in Proc. Int. Conf. Learning Representations (ICLR), 2021.
  12. G. Synnaeve et al., “End-to-end ASR: from supervised to semi-supervised learning with modern architectures,” in Proc. ICML Workshop on Self-supervision in Audio and Speech, 2020.
  13. R. Prabhavalkar et al., “End-to-end speech recognition: A survey,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  14. K. C. Puvvada et al., “Discrete audio representation as an alternative to mel-spectrograms for speaker and speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2024.
  15. X. Chang et al., “Exploration of efficient end-to-end asr using discretized input from self-supervised learning,” arXiv preprint arXiv:2305.18108, 2023.
  16. X. Chang, B. Yan et al., “Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study,” arXiv preprint arXiv:2309.15800, 2023.
  17. W.-N. Hsu et al., “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  18. Y.-A. Chung et al., “w2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 244–250.
  19. S. Chen et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  20. Z. Huang, C. Meng, and T. Ko, “Repcodec: A speech representation codec for speech tokenization,” arXiv preprint arXiv:2309.00169, 2023.
  21. N. Zeghidour et al., “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  22. A. Défossez et al., “High fidelity neural audio compression,” Transactions on Machine Learning Research, 2023.
  23. R. Kumar et al., “High-fidelity audio compression with improved RVQGAN,” in Proc. Conf. on Neural Information Process. Systems (NeurIPS), 2023.
  24. Y.-C. Wu et al., “AudioDec: An open-source streaming high-fidelity neural audio codec,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  25. X. Zhang et al., “Speechtokenizer: Unified speech tokenizer for speech large language models,” arXiv preprint arXiv:2308.16692, 2023.
  26. Z. Borsos et al., “SoundStorm: Efficient parallel audio generation,” arXiv preprint arXiv:2305.09636, 2023.
  27. J. Shi, D. Berrebbi, W. Chen, H. L. Chung et al., “Ml-superb: Multilingual speech universal performance benchmark,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2023, 2023, pp. 884–888.
  28. F. Mentzer et al., “Finite scalar quantization: VQ-VAE made simple,” in Proc. International Conference on Learning Representations (ICLR), 2024.
  29. R. Langman et al., “Spectral Codecs: Spectrogram-based audio codecs for high quality speech synthesis,” arXiv preprint arXiv:2406.05298, 2024.
  30. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. Conf. on Neural Information Process. Systems (NeurIPS), 2020.
  31. D. S. Park et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” Interspeech 2019, 2019.
  32. N. Jain et al., “NEFTune: Noisy embeddings improve instruction finetuning,” in Proc. International Conference on Learning Representations (ICLR), 2023.
  33. J. Kahn et al., “Libri-Light: A benchmark for asr with limited or no supervision,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020.
  34. D. Rekesh et al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
  35. V. Panayotov et al., “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP).   IEEE, 2015, pp. 5206–5210.
  36. T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proc. Conf. on Empirical Methods in Natural Language Processing: System Demonstrations, 2018.
  37. X. Chang et al., “Interspeech 2024 speech processing using discrete speech unit challenge,” https://www.wavlab.org/activities/2024/Interspeech2024-Discrete-Speech-Unit-Challenge, [Online].
  38. NVIDIA, “NeMo: a toolkit for conversational AI,” https://github.com/NVIDIA/NeMo, [Online; accessed May, 2024].
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.