Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels (2312.13560v2)

Published 21 Dec 2023 in cs.SD and eess.AS

Abstract: The success of retrieval-augmented LLMs in various NLP tasks has been constrained in automatic speech recognition (ASR) applications due to challenges in constructing fine-grained audio-text datastores. This paper presents kNN-CTC, a novel approach that overcomes these challenges by leveraging Connectionist Temporal Classification (CTC) pseudo labels to establish frame-level audio-text key-value pairs, circumventing the need for precise ground truth alignments. We further introduce a skip-blank strategy, which strategically ignores CTC blank frames, to reduce datastore size. kNN-CTC incorporates a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems, achieving significant improvements in performance. By incorporating a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems and leveraging a fine-grained, pruned datastore, kNN-CTC consistently achieves substantial improvements in performance under various experimental settings. Our code is available at https://github.com/NKU-HLT/KNN-CTC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. “Generalization through memorization: Nearest neighbor language models,” in ICLR, 2020.
  2. “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in NeurIPS. 2020, Curran Associates Inc.
  3. “Nearest neighbor machine translation,” ICLR, 2021.
  4. “Towards robust k-nearest-neighbor machine translation,” in EMNLP, 2022, pp. 5468–5477.
  5. “Neuro-symbolic language modeling with automaton-augmented retrieval,” in ICML. PMLR, 2022, pp. 468–485.
  6. “Speech recognition with state-based nearest neighbour classifiers,” in Eighth Annual Conference of the International Speech Communication Association, 2007.
  7. “Exemplar-based large vocabulary speech recognition using k-nearest neighbors,” in 2015 ICASSP. IEEE, 2015, pp. 5167–5171.
  8. “On-the-fly text retrieval for end-to-end asr adaptation,” in ICASSP, 2023, pp. 1–5.
  9. Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  10. “Domain adaptation with external off-policy acoustic catalogs for scalable contextual end-to-end automated speech recognition,” in ICASSP 2023, 2023, pp. 1–5.
  11. “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
  12. “RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting,” in Proc. INTERSPEECH 2023, 2023, pp. 1095–1099.
  13. “Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 1913–1917.
  14. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006, pp. 369–376.
  15. “Cross-domain speech recognition with unsupervised character-level distribution matching,” INTERSPEECH, 2021.
  16. “Madi: Inter-domain matching and intra-domain discrimination for cross-domain speech recognition,” in ICASSP, 2023, pp. 1–5.
  17. “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in O-COCOSDA. IEEE, 2017, pp. 1–5.
  18. “Libri-adapt: a new speech dataset for unsupervised domain adaptation,” in ICASSP. IEEE, 2020, pp. 7439–7443.
  19. “Aishell-2: Transforming mandarin asr research into industrial scale,” arXiv preprint arXiv:1808.10583, 2018.
  20. “Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in ICASSP. IEEE, 2022, pp. 6182–6186.
  21. “Kespeech: An open source speech dataset of mandarin and its eight subdialects,” in NeurIPS, 2021.
  22. “WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit,” in Proc. Interspeech 2021, 2021, pp. 4054–4058.
  23. “Billion-scale similarity search with gpus,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiaming Zhou (41 papers)
  2. Shiwan Zhao (47 papers)
  3. Yaqi Liu (18 papers)
  4. Wenjia Zeng (5 papers)
  5. Yong Chen (299 papers)
  6. Yong Qin (35 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com