kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels (2312.13560v2)
Abstract: The success of retrieval-augmented LLMs in various NLP tasks has been constrained in automatic speech recognition (ASR) applications due to challenges in constructing fine-grained audio-text datastores. This paper presents kNN-CTC, a novel approach that overcomes these challenges by leveraging Connectionist Temporal Classification (CTC) pseudo labels to establish frame-level audio-text key-value pairs, circumventing the need for precise ground truth alignments. We further introduce a skip-blank strategy, which strategically ignores CTC blank frames, to reduce datastore size. kNN-CTC incorporates a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems, achieving significant improvements in performance. By incorporating a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems and leveraging a fine-grained, pruned datastore, kNN-CTC consistently achieves substantial improvements in performance under various experimental settings. Our code is available at https://github.com/NKU-HLT/KNN-CTC.
- “Generalization through memorization: Nearest neighbor language models,” in ICLR, 2020.
- “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in NeurIPS. 2020, Curran Associates Inc.
- “Nearest neighbor machine translation,” ICLR, 2021.
- “Towards robust k-nearest-neighbor machine translation,” in EMNLP, 2022, pp. 5468–5477.
- “Neuro-symbolic language modeling with automaton-augmented retrieval,” in ICML. PMLR, 2022, pp. 468–485.
- “Speech recognition with state-based nearest neighbour classifiers,” in Eighth Annual Conference of the International Speech Communication Association, 2007.
- “Exemplar-based large vocabulary speech recognition using k-nearest neighbors,” in 2015 ICASSP. IEEE, 2015, pp. 5167–5171.
- “On-the-fly text retrieval for end-to-end asr adaptation,” in ICASSP, 2023, pp. 1–5.
- Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
- “Domain adaptation with external off-policy acoustic catalogs for scalable contextual end-to-end automated speech recognition,” in ICASSP 2023, 2023, pp. 1–5.
- “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
- “RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting,” in Proc. INTERSPEECH 2023, 2023, pp. 1095–1099.
- “Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 1913–1917.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006, pp. 369–376.
- “Cross-domain speech recognition with unsupervised character-level distribution matching,” INTERSPEECH, 2021.
- “Madi: Inter-domain matching and intra-domain discrimination for cross-domain speech recognition,” in ICASSP, 2023, pp. 1–5.
- “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in O-COCOSDA. IEEE, 2017, pp. 1–5.
- “Libri-adapt: a new speech dataset for unsupervised domain adaptation,” in ICASSP. IEEE, 2020, pp. 7439–7443.
- “Aishell-2: Transforming mandarin asr research into industrial scale,” arXiv preprint arXiv:1808.10583, 2018.
- “Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in ICASSP. IEEE, 2022, pp. 6182–6186.
- “Kespeech: An open source speech dataset of mandarin and its eight subdialects,” in NeurIPS, 2021.
- “WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit,” in Proc. Interspeech 2021, 2021, pp. 4054–4058.
- “Billion-scale similarity search with gpus,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021.
- Jiaming Zhou (41 papers)
- Shiwan Zhao (47 papers)
- Yaqi Liu (18 papers)
- Wenjia Zeng (5 papers)
- Yong Chen (299 papers)
- Yong Qin (35 papers)