Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations (2407.03495v1)
Abstract: Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models. In this work, we present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for codec training such as quantization schemes and time-domain vs spectral feature encodings. We further explore ASR training techniques aimed at enhancing performance, training efficiency, and noise robustness. Drawing upon our findings, we introduce a codec ASR pipeline that outperforms Encodec at similar bit-rate. Remarkably, it also surpasses the state-of-the-art results achieved by strong self-supervised models on the 143 languages ML-SUPERB benchmark despite being smaller in size and pretrained on significantly less data.
- G. Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
- W. Chan et al., “Speechstew: Simply mix all available speech recognition data to train one large neural network,” arXiv preprint arXiv:2104.02133, 2021.
- T. J. Park et al., “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, 2022.
- Q. Zhang et al., “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020.
- A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020.
- C. Wang et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
- X. Wang et al., “SpeechX: Neural codec language model as a versatile speech transformer,” arXiv preprint arXiv:2308.06873, 2023.
- T. N. Sainath et al., “Multichannel signal processing with deep neural networks for automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 965–979, 2017.
- Y. Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2018, pp. 696–700.
- M. Won et al., “Data-driven harmonic filters for audio representation learning,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020.
- N. Zeghidour et al., “LEAF: A learnable frontend for audio classification,” in Proc. Int. Conf. Learning Representations (ICLR), 2021.
- G. Synnaeve et al., “End-to-end ASR: from supervised to semi-supervised learning with modern architectures,” in Proc. ICML Workshop on Self-supervision in Audio and Speech, 2020.
- R. Prabhavalkar et al., “End-to-end speech recognition: A survey,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- K. C. Puvvada et al., “Discrete audio representation as an alternative to mel-spectrograms for speaker and speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2024.
- X. Chang et al., “Exploration of efficient end-to-end asr using discretized input from self-supervised learning,” arXiv preprint arXiv:2305.18108, 2023.
- X. Chang, B. Yan et al., “Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study,” arXiv preprint arXiv:2309.15800, 2023.
- W.-N. Hsu et al., “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- Y.-A. Chung et al., “w2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 244–250.
- S. Chen et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- Z. Huang, C. Meng, and T. Ko, “Repcodec: A speech representation codec for speech tokenization,” arXiv preprint arXiv:2309.00169, 2023.
- N. Zeghidour et al., “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
- A. Défossez et al., “High fidelity neural audio compression,” Transactions on Machine Learning Research, 2023.
- R. Kumar et al., “High-fidelity audio compression with improved RVQGAN,” in Proc. Conf. on Neural Information Process. Systems (NeurIPS), 2023.
- Y.-C. Wu et al., “AudioDec: An open-source streaming high-fidelity neural audio codec,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
- X. Zhang et al., “Speechtokenizer: Unified speech tokenizer for speech large language models,” arXiv preprint arXiv:2308.16692, 2023.
- Z. Borsos et al., “SoundStorm: Efficient parallel audio generation,” arXiv preprint arXiv:2305.09636, 2023.
- J. Shi, D. Berrebbi, W. Chen, H. L. Chung et al., “Ml-superb: Multilingual speech universal performance benchmark,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2023, 2023, pp. 884–888.
- F. Mentzer et al., “Finite scalar quantization: VQ-VAE made simple,” in Proc. International Conference on Learning Representations (ICLR), 2024.
- R. Langman et al., “Spectral Codecs: Spectrogram-based audio codecs for high quality speech synthesis,” arXiv preprint arXiv:2406.05298, 2024.
- J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. Conf. on Neural Information Process. Systems (NeurIPS), 2020.
- D. S. Park et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” Interspeech 2019, 2019.
- N. Jain et al., “NEFTune: Noisy embeddings improve instruction finetuning,” in Proc. International Conference on Learning Representations (ICLR), 2023.
- J. Kahn et al., “Libri-Light: A benchmark for asr with limited or no supervision,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020.
- D. Rekesh et al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
- V. Panayotov et al., “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP). IEEE, 2015, pp. 5206–5210.
- T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proc. Conf. on Empirical Methods in Natural Language Processing: System Demonstrations, 2018.
- X. Chang et al., “Interspeech 2024 speech processing using discrete speech unit challenge,” https://www.wavlab.org/activities/2024/Interspeech2024-Discrete-Speech-Unit-Challenge, [Online].
- NVIDIA, “NeMo: a toolkit for conversational AI,” https://github.com/NVIDIA/NeMo, [Online; accessed May, 2024].
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.