Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks (2401.02921v1)
Abstract: In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying LLMs with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors can degrade subsequent SLU tasks. Here we introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts with the help of word confusion networks from lattices, bridging the SLU performance gap between using the top ASR hypothesis and an oracle upper bound. Additionally, we delve into the LLM's robustness to varying ASR performance conditions and scrutinize the aspects of in-context learning which prove the most influential.
- “N-best ASR transformer: Enhancing SLU performance using multiple ASR hypotheses,” in Proc. of NAACL. Aug. 2021, pp. 93–98, Association for Computational Linguistics.
- “Lattice rnn: Recurrent neural networks over lattices,” 2016.
- “Adapting pretrained transformer to lattices for spoken language understanding,” 2020.
- Ryo Masumura et al., “Neural confnet classification: Fully neural network based spoken utterance classification using word confusion networks,” in Proc. of ICASSP, 2018, pp. 6039–6043.
- “Jointly encoding word confusion network and dialogue context with bert for spoken language understanding,” 2020.
- “Voicebox: Text-guided multilingual universal speech generation at scale,” arXiv preprint arXiv:2306.15687, 2023.
- “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
- “Can chatgpt detect intent? evaluating large language models for spoken language understanding,” 2023.
- “Slurp: A spoken language understanding resource package,” 2020.
- “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- “Librispeech: an asr corpus based on public domain audio books,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210.
- “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Dec. 2011, IEEE Signal Processing Society, IEEE Catalog No.: CFP11SRW-USB.
- “Minimum bayes risk decoding and system combination based on a recursion for edit distance,” Computer Speech Language, vol. 25, no. 4, pp. 802–828, 2011.
- “Crosslingual generalization through multitask finetuning,” arXiv preprint arXiv:2211.01786, 2022.
- “Bloom: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
- “Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,” Proc. of EMNLP, 2022.
- “Datasets: A community library for natural language processing,” in Proc. of EMNLP, 2021, pp. 175–184.
- “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, 2022.
- “Deep reinforcement learning from human preferences,” NeurIPS, vol. 30, 2017.
- “DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering,” in Proc. Interspeech 2022, 2022, pp. 5165–5169.
- “Squad: 100,000+ questions for machine comprehension of text,” 2016.
- “Spoken squad: A study of mitigating the impact of speech recognition errors on listening comprehension,” Proc. Interspeech 2018, pp. 3459–3463, 2018.
- “Promptchainer: Chaining large language model prompts through visual programming,” in Proc. of CHI, 2022, pp. 1–10.
- “Speechprompt v2: Prompt tuning for speech classification tasks,” arXiv preprint arXiv:2303.00733, 2023.
- “Voice2series: Reprogramming acoustic models for time series classification,” in International Conference on Machine Learning. PMLR, 2021, pp. 11808–11819.
- “Speechgen: Unlocking the generative power of speech language models with prompts,” arXiv preprint arXiv:2306.02207, 2023.
- “Speechprompt: An exploration of prompt tuning on generative spoken language model for speech processing tasks,” arXiv preprint arXiv:2203.16773, 2022.
- “The ATIS spoken language systems pilot corpus,” in Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990, 1990.
- Kevin Everson (2 papers)
- Yile Gu (25 papers)
- Huck Yang (4 papers)
- Prashanth Gurunath Shivakumar (18 papers)
- Guan-Ting Lin (21 papers)
- Jari Kolehmainen (13 papers)
- Ivan Bulyko (23 papers)
- Ankur Gandhe (30 papers)
- Shalini Ghosh (34 papers)
- Wael Hamza (26 papers)
- Hung-yi Lee (325 papers)
- Ariya Rastrow (55 papers)
- Andreas Stolcke (57 papers)