Optimized Tokenization for Transcribed Error Correction (2310.10704v1)
Abstract: The challenges facing speech recognition systems, such as variations in pronunciations, adverse audio conditions, and the scarcity of labeled data, emphasize the necessity for a post-processing step that corrects recurring errors. Previous research has shown the advantages of employing dedicated error correction models, yet training such models requires large amounts of labeled data which is not easily obtained. To overcome this limitation, synthetic transcribed-like data is often utilized, however, bridging the distribution gap between transcribed errors and synthetic noise is not trivial. In this paper, we demonstrate that the performance of correction models can be significantly increased by training solely using synthetic data. Specifically, we empirically show that: (1) synthetic data generated using the error distribution derived from a set of transcribed data outperforms the common approach of applying random perturbations; (2) applying language-specific adjustments to the vocabulary of a BPE tokenizer strike a balance between adapting to unseen distributions and retaining knowledge of transcribed errors. We showcase the benefits of these key observations, and evaluate our approach using multiple languages, speech recognition systems and prominent speech recognition datasets.
- Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670.
- Xls-r: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
- A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210.
- Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. International Conference on Learning Representations (ICLR).
- Characterbert: Reconciling elmo and bert for word-level open-vocabulary representations from characters. Proceedings of the 28th International Conference on Computational Linguistics.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
- Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91.
- Red-ace: Robust error detection for asr using confidence embeddings. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
- Thamme Gowda and Jonathan May. 2020. Finding the optimal vocabulary size for neural machine translation. Findings of the Association for Computational Linguistics: EMNLP 2020.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376.
- Correction of automatic speech recognition with transformer sequence-to-sequence model. In Icassp 2020-2020 ieee international conference on acoustics, speech and signal processing (icassp), pages 7074–7078. IEEE.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
- Spellbert: A lightweight pretrained model for chinese spelling check. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3544–3551.
- Learning to learn morphological inflection for resource-poor languages. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8058–8065.
- Training on synthetic noise improves robustness to natural noise in machine translation. arXiv preprint arXiv:1902.01509.
- How bpe affects memorization in transformers. arXiv preprint arXiv:2110.02782.
- Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018.
- Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. EMNLP 2018.
- Softcorrect: Error correction with soft detection for automatic speech recognition. arXiv preprint arXiv:2212.01039.
- Fastcorrect 2: Fast error correction on multiple candidates for automatic speech recognition. arXiv preprint arXiv:2109.14420.
- Fastcorrect: Fast error correction with edit alignment for automatic speech recognition. Advances in Neural Information Processing Systems, 34:21708–21719.
- Charbert: character-aware pre-trained language model. Proceedings of the 28th International Conference on Computational Linguistics.
- Asr error correction and domain adaptation using machine translation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6344–6348. IEEE.
- Milad Moradi and Matthias Samwald. 2021. Evaluating the robustness of neural language models to input perturbations. arXiv preprint arXiv:2108.12237.
- Mls: A large-scale multilingual dataset for speech research. Interspeech 2020.
- Bpe-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Mask the correct tokens: An embarrassingly simple approach for error correction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
- Adv-bert: Bert is not robust on misspellings! generating nature adversarial samples on bert. arXiv preprint arXiv:2003.04985.
- Attention is all you need. Advances in neural information processing systems, 30.
- mt5: A massively multilingual pre-trained text-to-text transformer. In NAACL, 2021.