2000 character limit reached
Boosting Norwegian Automatic Speech Recognition (2307.01672v1)
Published 4 Jul 2023 in cs.CL
Abstract: In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against previous state-of-the-art ASR models, as well as on out-of-domain datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10\% to 7.60\%, with models achieving 5.81\% for Bokm{\aa}l and 11.54\% for Nynorsk. We also discuss the challenges and potential solutions for further improving ASR models for Norwegian.
- Maximum likelihood pronunciation modelling of Norwegian natural numbers for automatic speech recognition. In Proc. Norwegian Signal Processing Symposium (NORSIG), pages 145–150.
- Ingunn Amdal and Harald Ljøen. 1995. TABU.0 - en norsk telefontaledatabase. Scientific Report, 40:95.
- Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182. PMLR.
- XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
- FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech. 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805.
- Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation, pages 187–197.
- European speech databases for telephone applications. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 3, pages 1771–1774. IEEE.
- The Norwegian part of speechdat: A European speech database for creation of voice driven teleservices. Proceedings of NORSIG-1997.
- Operationalizing a national digital library: The case for a Norwegian transformer model. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 20–29, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
- The Norwegian colossal corpus: A text corpus for training large Norwegian language models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3852–3860, Marseille, France. European Language Resources Association.
- Knut Kvale. 1996. Norwegian numerals: A challenge to automatic speech recognition. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96, volume 4, pages 2028–2031. IEEE.
- Knut Kvale and Ingunn Amdal. 1997. Improved automatic recognition of Norwegian natural numbers by incorporating phonetic knowledge. 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, 3:1763–1766 vol.3.
- Norwegian speech recognition for telephone applications. In Proc. Norsig, volume 94, pages 121–125.
- Hearing voices at the National Library–a speech corpus and acoustic model for the Swedish language. arXiv preprint arXiv:2205.03026.
- Jean-Pierre Martens. 2000. Final report of COST action 249: Continuous speech recognition over the telephone. Technical report, Electronics & Information Systems, Ghent University.
- On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language, 8(1):1–38.
- Nordisk Språkteknologi. 2020. NST Norwegian ASR Database (16 kHz) – Reorganized.
- Pablo Ortiz and Simen Burud. 2021. BERT attends the conversation: Improving low-resource conversational ASR. arXiv preprint arXiv:2110.02267.
- Kuldip K. Paliwal. 1992. On the use of line spectral frequency parameters for speech recognition. Digital signal processing, 2(2):80–87.
- Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
- Robust speech recognition via large-scale weak supervision.
- wav2vec: Unsupervised pre-training for speech recognition. Proc. Interspeech 2019, pages 3465–3469.
- Per Erik Solberg and Pablo Ortiz. 2022. The Norwegian parliamentary speech corpus. arXiv preprint arXiv:2201.10881.
- An improved sub-word based speech recognizer. In International Conference on Acoustics, Speech, and Signal Processing,, pages 108–111. IEEE.
- The National Library of Norway. 2021. Norwegian Parliamentary Speech Corpus.
- The HTK hidden Markov model toolkit: Design and philosophy. University of Cambridge, Department of Engineering Cambridge.