High-precision Voice Search Query Correction via Retrievable Speech-text Embedings (2401.04235v1)
Abstract: Automatic speech recognition (ASR) systems can suffer from poor recall for various reasons, such as noisy audio, lack of sufficient training data, etc. Previous work has shown that recall can be improved by retrieving rewrite candidates from a large database of likely, contextually-relevant alternatives to the hypothesis text using nearest-neighbors search over embeddings of the ASR hypothesis text to correct and candidate corrections. However, ASR-hypothesis-based retrieval can yield poor precision if the textual hypotheses are too phonetically dissimilar to the transcript truth. In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together. After locating an appropriate correction candidate using nearest-neighbor search, we score the candidate with its speech-text embedding distance before adding the candidate to the original n-best list. We show a relative word error rate (WER) reduction of 6% on utterances whose transcripts appear in the candidate set, without increasing WER on general utterances.
- “Learning dense representations for entity retrieval,” in Conference on Computational Natural Language Learning (CoNLL), 2019.
- “Accelerating large-scale inference with anisotropic vector quantization,” in International Conference on Machine Learning (ICML), 2020.
- “Improving entity recall in automatic speech recognition with neural embeddings,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
- “Bringing contextual information to Google speech recognition,” in Interspeech, 2015.
- I. Williams and P. Aleksic, “Rescoring-aware beam search for reduced search errors in contextual automatic speech recognition,” in Interspeech, 2017.
- “Shallow-fusion end-to-end contextual biasing,” in Interspeech, 2019.
- “Deep context: End-to-end contextual speech recognition,” in SLT, 2018.
- “Fast contextual adaptation with neural associative memory for On-Device personalized speech recognition,” in ICASSP, 2022.
- “Context-aware transformer transducer for speech recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021.
- “Generalization through memorization: Nearest neighbor language models,” in ICLR, 2020.
- “Contextual recovery of out-of-lattice named entities in automatic speech recognition,” in Interspeech, 2019.
- “Class LM and word mapping for contextual biasing in end-to-end ASR,” in Interspeech, 2020.
- “Towards contextual spelling correction for customization of end-to-end speech recognition systems,” 2022.
- “SpellMapper: A non-autoregressive neural spellchecker for ASR customization with candidate retrieval based on n-gram mappings,” in Interspeech, 2023.
- “Improving contextual spelling correction by external acoustics attention and semantic aware data augmentation,” in ICASSP, 2023.
- “MAESTRO: Matched speech text representations through modality matching,” in Interspeech, 2022.
- “Understanding shared speech-text representations,” in ICASSP, 2023.
- “JOIST: A joint speech and text streaming model for ASR,” in SLT, 2022.
- “Unified speech-text pre-training for speech translation and recognition,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
- “Speech recognition and keyword spotting for low resource langauges: Babel project research at CUED,” in International Workshop on Spoken Language Technologies for Under-Resourced Languages, 2014.
- “End-to-end ASR-free keyword search from speech,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1351–1359, 2017.
- T. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Interspeech, 2015.
- R. Alvarez and H.-J. Park, “End-to-end streaming keyword spotting,” in ICASSP, 2019.
- “Open-Vocabulary Keyword Spotting with Audio and Text Embeddings,” in Interspeech.
- “Domain adaptation with external off-policy acoustic catalogs for scalable contextual end-to-end automated speech recognition,” in ICASSP, 2023.
- “Librispeech: An ASR corpus based on public domain audio books,” in ICASSP, 2015.
- “w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in ASRU, 2021.
- “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Interspeech, 2019.
- “On the importance of initialization and momentum in deep learning,” in ICML, 2013.
- “An efficient streaming non-recurrent on-device end-to-end model with improvements to rare-word modeling,” in Interspeech, 2021.
- M. Schuster and K. Nakajima, “Japanese and Korean voice search,” ICASSP, 2012.
- “Hybrid autoregressive transducer (hat),” in ICASSP, 2020.
- “A hybrid seq-2-seq ASR design for on-device and server applications,” in Interspeech, 2021.
- Google, “Artificial intelligence at Google: Our principles,” https://ai.google/principles.