Creating Spoken Dialog Systems in Ultra-Low Resourced Settings (2312.06266v1)
Abstract: Automatic Speech Recognition (ASR) systems are a crucial technology that is used today to design a wide variety of applications, most notably, smart assistants, such as Alexa. ASR systems are essentially dialogue systems that employ Spoken Language Understanding (SLU) to extract meaningful information from speech. The main challenge with designing such systems is that they require a huge amount of labeled clean data to perform competitively, such data is extremely hard to collect and annotate to respective SLU tasks, furthermore, when designing such systems for low resource languages, where data is extremely limited, the severity of the problem intensifies. In this paper, we focus on a fairly popular SLU task, that is, Intent Classification while working with a low resource language, namely, Flemish. Intent Classification is a task concerned with understanding the intents of the user interacting with the system. We build on existing light models for intent classification in Flemish, and our main contribution is applying different augmentation techniques on two levels -- the voice level, and the phonetic transcripts level -- to the existing models to counter the problem of scarce labeled data in low-resource languages. We find that our data augmentation techniques, on both levels, have improved the model performance on a number of tasks.
- The fluent speech commands (fsc) dataset. URL https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/.
- Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020.
- Domain specific intent classification of sinhala speech data. In 2018 International Conference on Asian Language Processing (IALP), pages 197–202. IEEE, 2018.
- J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
- Performance vs. hardware requirements in state-of-the-art automatic speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2021(1):1–30, 2021.
- R. Gokay and H. Yalcin. Improving low resource turkish speech recognition with data augmentation and tts. In 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD), pages 357–360. IEEE, 2019.
- A. Gonfalonieri. How amazon alexa works? your guide to natural language processing (ai), 2018.
- A. Gupta. On building spoken language understanding systems for low resourced languages. In Submitted to 4th Workshop on NLP for Conversational AI, 2022. URL https://openreview.net/forum?id=rrgMHxl-e-5. under review.
- Mere account mein kitna balance hai? - on building voice enabled banking services for multilingual communities. 2020a.
- Dummy banking app, 2020b. URL https://awb.pc.cs.cmu.edu/.
- Intent recognition and unsupervised slot identification for low resourced spoken dialog systems. arXiv preprint arXiv:2104.01287, 2021a.
- Acoustics based intent recognition using discovered phonetic units for low resource languages. 2021b.
- A. Ignatius and U. Thayasivam. Speaker-invariant speech-to-intent classification for low-resource languages. In International Conference on Speech and Computer, pages 279–290. Springer, 2021.
- N. Jaitly and G. E. Hinton. Vocal tract length perturbation (vtlp) improves speech recognition. In Proc. ICML Workshop on Deep Learning for Audio, Speech and Language, volume 117, page 21, 2013.
- M. I. Jordan. Serial order: A parallel distributed processing approach. In Advances in psychology, volume 121, pages 471–495. Elsevier, 1997.
- Sinhala and tamil speech intent identification from english phoneme based asr. In 2019 International Conference on Asian Language Processing (IALP), pages 234–239. IEEE, 2019.
- Universal phone recognition with a multilingual allophone system. 2020.
- B. Liu and I. Lane. Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv preprint arXiv:1609.01454, 2016.
- S. Louvan and B. Magnini. Recent neural methods on slot filling and intent classification for task-oriented dialogue systems: A survey. arXiv preprint arXiv:2011.00564, 2020.
- Speech model pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670, 2019.
- Using speech synthesis to train end-to-end spoken language understanding models. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8499–8503. IEEE, 2020.
- Mixspeech: Data augmentation for low-resource automatic speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7008–7012. IEEE, 2021.
- Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In Interspeech, pages 3771–3775, 2013.
- Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
- A review of dialogue systems: From trained monkeys to stochastic parrots. arXiv preprint arXiv:2111.01414, 2021.
- Acquisition of ordinal words using weakly supervised nmf. In 2014 IEEE Spoken Language Technology Workshop (SLT), pages 30–35. IEEE, 2014.
- L. V. Rygaard. Using synthesized speech to improve speech recognition for lowresource languages. Grace Hopper Celebration, 2015.
- Intent classification using pre-trained embeddings for low resource languages. arXiv preprint arXiv:2110.09264, 2021.
- Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 55–64, 2016.