Multimodal Audio-textual Architecture for Robust Spoken Language Understanding (2306.06819v2)
Abstract: Recent voice assistants are usually based on the cascade spoken language understanding (SLU) solution, which consists of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained LLMs (PLM), such as BERT and RoBERTa. Moreover, a multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors present in the ASR transcript. The MLU benefits from self-supervised features learned from both audio and text modalities, specifically Wav2Vec for speech and Bert/RoBERTa for language. Our MLU combines an encoder network to embed the audio signal and a text encoder to process text transcripts followed by a late fusion layer to fuse audio and text logits. We found that the proposed MLU showed to be robust towards poor quality ASR transcripts, while the performance of BERT and RoBERTa are severely compromised. Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.
- Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding. arXiv preprint arXiv:2011.09044.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.
- Slurp: A spoken language understanding resource package. arXiv preprint arXiv:2011.13205.
- Tessa Bent and Rachael F Holt. 2017. Representation of speech variability. Wiley Interdisciplinary Reviews: Cognitive Science, 8(4):e1434.
- On a simple and efficient approach to probability distribution function aggregation. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(9):2444–2453.
- Spoken language understanding without speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6189–6193. IEEE.
- Claude Coulombe. 2018. Text data augmentation made simple by leveraging nlp cloud apis. arXiv preprint arXiv:1812.04718.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Google. 2021. Google asr api. Accessed on 13-August-2021].
- Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.
- From audio to semantics: Approaches to end-to-end spoken language understanding. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 720–726. IEEE.
- Chao-Wei Huang and Yun-Nung Chen. 2020. Learning asr-robust contextualized embeddings for spoken language understanding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8009–8013. IEEE.
- Leveraging unpaired text data for training end-to-end speech-to-intent systems. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7984–7988. IEEE.
- Veton Këpuska and Gamal Bohouta. 2017. Comparing speech recognition systems (microsoft api, google api and cmu sphinx). Int. J. Eng. Res. Appl, 7(03):20–24.
- Onenet: Joint domain, intent, slot prediction for spoken language understanding. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 547–553. IEEE.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
- Speech model pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670.
- Neural confnet classification: Fully neural network based spoken utterance classification using word confusion networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6039–6043. IEEE.
- Spoken language understanding on the edge. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 57–61. IEEE.
- Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128.
- wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.
- Towards end-to-end spoken language understanding. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5754–5758. IEEE.
- Simulating asr errors for training slu systems. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
- Asr error management for improving spoken language understanding. arXiv preprint arXiv:1705.09515.
- WIT. 2021. Build natural language experiences. Accessed on 13-August-2021].
- Robust spoken language understanding with unsupervised asr-error adaptation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6179–6183. IEEE.
- Anderson R. Avila (10 papers)
- Mehdi Rezagholizadeh (78 papers)
- Chao Xing (11 papers)