On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition (2310.08132v1)
Abstract: Synthetic data generated by text-to-speech (TTS) systems can be used to improve automatic speech recognition (ASR) systems in low-resource or domain mismatch tasks. It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. By using a novel oracle setup we show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive (NAR) TTS. To get reference phoneme durations we use two common alignment methods, a hidden Markov Gaussian-mixture model (HMM-GMM) aligner and a neural connectionist temporal classification (CTC) aligner. Using a simple algorithm based on random walks we shift phoneme duration distributions of the TTS system closer to real durations, resulting in an improvement of an ASR system using synthetic data in a semi-supervised setting.
- “Semi-supervised sequence-to-sequence ASR using unpaired speech and text,” in Interspeech 2019. September 2019, ISCA.
- “Speech recognition with augmented synthesized speech,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), December 2019.
- “Generating synthetic audio data for attention-based speech recognition systems,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). January 2020, number 9052899, pp. 7069–7073, IEEE.
- “You do not need more data: Improving end-to-end speech recognition by text-to-speech data augmentation,” 2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), October 2020.
- “Eat: Enhanced ASR-TTS for self-supervised speech recognition,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). June 2021, IEEE.
- “Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). June 2021, IEEE.
- “Comparing the benefit of synthetic training data for various automatic speech recognition architectures,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 788–795.
- “Exploring neural transducers for end-to-end speech recognition,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 206–213.
- “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.
- “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). April 2018, IEEE.
- “Data augmentation for asr using tts via a discrete representation,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), December 2021, pp. 68–75.
- “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” in International Conference on Learning Representations (ICLR), December 2021.
- Adrian Łańcucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6588–6592.
- “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proceedings of the 38th International Conference on Machine Learning, Marina Meila and Tong Zhang, Eds. 18–24 Jul 2021, vol. 139 of Proceedings of Machine Learning Research, pp. 5530–5540, PMLR.
- “SYNT++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022.
- “Evaluating and reducing the distance between synthetic and real speech distributions,” ArXiv, vol. abs/2211.16049, November 2022.
- “Evaluating Speech–Phoneme Alignment and its Impact on Neural Text-To-Speech Synthesis,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1 2023, number 10094560, pp. 1–5, IEEE.
- “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” in Interspeech 2017. August 2017, ISCA.
- “VRAIN-UPV MLLP’s system for the Blizzard Challenge 2021,” Festvox Blizzard Challenge 2021, October 2021.
- “Librispeech: An asr corpus based on public domain audio books.,” in ICASSP. 2015, pp. 5206–5210, IEEE.
- “RASR/NN: The RWTH neural network toolkit for speech recognition,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). May 2014, IEEE.
- “Returnn: The RWTH extensible training framework for universal recurrent neural networks,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pp. 5345–5349.
- “Sisyphus, a workflow manager designed for machine translation and automatic speech recognition,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pp. 84–89.
- “Non-attentive tacotron: Robust and controllable neural TTS synthesis including unsupervised duration modeling,” ArXiv, vol. abs/2010.04301v3, October 2020.
- “Long short-term memory,” Neural computation, vol. 9, no. 8, 1997.
- “Speech synthesis from short-time fourier transform magnitude and its application to speech processing,” in ICASSP ’84, San Diego, California, USA, March 19-21, 1984, pp. 61–64.
- “Joint-sequence models for grapheme-to-phoneme conversion,” Speech Communication, vol. 50, no. 5, pp. 434–451, May 2008.
- Steve J Young, “The general use of tying in phoneme-based HMM speech recognisers,” in ICASSP, 1992.
- M.J.F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech & Language, vol. 12, no. 2, pp. 75–98, April 1998.
- “Improved training of end-to-end attention models for speech recognition,” in Interspeech 2018. September 2018, ISCA.
- “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
- “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech 2019, 2019, pp. 2613–2617.
- “Language Modeling with Deep Transformers,” in Proc. Interspeech 2019, 2019, pp. 3905–3909.
- R. Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” in Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, 1993, vol. 1, pp. 125–128 vol.1.
- “Automos: Learning a non-intrusive assessor of naturalness-of-speech,” in NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop, 2016.
- “The VoiceMOS Challenge 2022,” in Proc. Interspeech 2022, 2022, pp. 4536–4540.
- “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Aug. 2016, pp. 1715–1725, Association for Computational Linguistics.
- “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2015.
- “LibriTTS: A corpus derived from librispeech for text-to-speech,” in Interspeech 2019. September 2019, ISCA.
- Nick Rossenbach (9 papers)
- Benedikt Hilmes (6 papers)
- Ralf Schlüter (73 papers)