Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models (2407.01909v1)

Published 2 Jul 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Recent studies have demonstrated the efficacy of LLMs in error correction for automatic speech recognition (ASR). However, much of the research focuses on the English language. This paper redirects the attention to Chinese. Firstly, we construct a specialized benchmark dataset aimed at error correction for Chinese ASR with 724K hypotheses-transcription pairs, named the Chinese Hypotheses Paradise dataset (ChineseHP), which contains a wide range of scenarios and presents significant challenges. Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. Furthermore, we propose a straightforward method of Pinyin regularization for prompts, which involves the transcription of Pinyin directly from text hypotheses. The experimental results reveal that Pinyin regularization consistently enhances the error-correcting ability of LLMs when compared with those without regularization. The dataset is available on the website.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur, “Recurrent neural network based language model.” in Interspeech, vol. 2, no. 3.   Makuhari, 2010, pp. 1045–1048.
  2. E. Arisoy, A. Sethy, B. Ramabhadran, and S. Chen, “Bidirectional recurrent neural network language models for automatic speech recognition,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 5421–5425.
  3. J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring method using bert for speech recognition,” in Asian Conference on Machine Learning.   PMLR, 2019, pp. 1081–1093.
  4. C.-H. H. Yang, L. Liu, A. Gandhe, Y. Gu, A. Raju, D. Filimonov, and I. Bulyko, “Multi-task language modeling for improving speech recognition of rare words,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2021, pp. 1087–1093.
  5. Y. Yu, C.-H. H. Yang, J. Kolehmainen, P. G. Shivakumar, Y. Gu, S. R. R. Ren, Q. Luo, A. Gourav, I.-F. Chen, Y.-C. Liu et al., “Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2023, pp. 1–8.
  6. J. Guo, T. N. Sainath, and R. J. Weiss, “A spelling correction model for end-to-end speech recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 5651–5655.
  7. K. Hu, T. N. Sainath, R. Pang, and R. Prabhavalkar, “Deliberation model based two-pass end-to-end speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7799–7803.
  8. Y. Leng, X. Tan, R. Wang, L. Zhu, J. Xu, W. Liu, L. Liu, T. Qin, X.-Y. Li, E. Lin et al., “Fastcorrect 2: Fast error correction on multiple candidates for automatic speech recognition,” arXiv preprint arXiv:2109.14420, 2021.
  9. K. Hu, T. N. Sainath, Y. He, R. Prabhavalkar, T. Strohman, S. Mavandadi, and W. Wang, “Improving deliberation by text-only and semi-supervised training,” arXiv preprint arXiv:2206.14716, 2022.
  10. R. Ma, M. J. Gales, K. M. Knill, and M. Qian, “N-best t5: Robust asr error correction using multiple input hypotheses and constrained decoding space,” arXiv preprint arXiv:2303.00456, 2023.
  11. K. Hu, B. Li, and T. N. Sainath, “Scaling up deliberation for multilingual asr,” in 2022 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2023, pp. 771–776.
  12. C. Chen, Y. Hu, C.-H. H. Yang, H. Liu, S. M. Siniscalchi, and E. S. Chng, “Generative error correction for code-switching speech recognition using large language models,” arXiv preprint arXiv:2310.13013, 2023.
  13. C. Chen, Y. Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y. Chen, and E.-S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  14. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
  15. H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA).   IEEE, 2017, pp. 1–5.
  16. B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng et al., “Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6182–6186.
  17. Y. Fu, L. Cheng, S. Lv, Y. Jv, Y. Kong, Z. Chen, Y. Hu, L. Xie, J. Wu, H. Bu et al., “Aishell-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,” arXiv preprint arXiv:2104.03603, 2021.
  18. Z. Tang, D. Wang, Y. Xu, J. Sun, X. Lei, S. Zhao, C. Wen, X. Tan, C. Xie, S. Zhou et al., “Kespeech: An open source speech dataset of mandarin and its eight subdialects,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  19. Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “Glm: General language model pretraining with autoregressive blank infilling,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 320–335.
  20. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhiyuan Tang (34 papers)
  2. Dong Wang (628 papers)
  3. Shen Huang (25 papers)
  4. Shidong Shang (10 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com