Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Timestamped Embedding-Matching Acoustic-to-Word CTC ASR (2306.11473v1)

Published 20 Jun 2023 in cs.CL and eess.AS

Abstract: In this work, we describe a novel method of training an embedding-matching word-level connectionist temporal classification (CTC) automatic speech recognizer (ASR) such that it directly produces word start times and durations, required by many real-world applications, in addition to the transcription. The word timestamps enable the ASR to output word segmentations and word confusion networks without relying on a secondary model or forced alignment process when testing. Our proposed system has similar word segmentation accuracy as a hybrid DNN-HMM (Deep Neural Network-Hidden Markov Model) system, with less than 3ms difference in mean absolute error in word start times on TIMIT data. At the same time, we observed less than 5% relative increase in the word error rate compared to the non-timestamped system when using the same audio training data and nearly identical model size. We also contribute more rigorous analysis of multiple-hypothesis embedding-matching ASR in general.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. S. Settle, K. Audhkhasi, K. Livescu, and M. Picheny, “Acoustically grounded word embeddings for improved acoustics-to-word speech recognition,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5641–5645.
  2. H. Yen and W. Jeon, “Improvements to embedding-matching acoustic-to-word ASR using multiple-hypothesis pronunciation-based embeddings,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. [Online]. Available: https://arxiv.org/abs/2210.16726
  3. W. Jeon, “Acoustic neighbor embeddings,” 2020. [Online]. Available: https://arxiv.org/abs/2007.10329
  4. W. He, W. Wang, and K. Livescu, “Multi-view recurrent neural acoustic word embeddings,” in International Conference on Learning Representations (ICLR), 2017.
  5. V. Garcia, E. Debreuve, F. Nielsen, and M. Barlaud, “K-nearest neighbor search: Fast gpu-based implementations and application to high-dimensional feature matching,” in 2010 IEEE International Conference on Image Processing, 2010, pp. 3757–3760.
  6. L. Mangu, E. Brill, and A. Stolcke, “Finding consensus among words: lattice-based word error minimisation,” Computer Speech and Language, pp. 373–400, 2000.
  7. W. Jeon, M. Jordan, and M. Krishnamoorthy, “On modeling ASR word confidence,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6324–6328.
  8. G.-X. Shi, W.-Q. Zhang, G.-B. Wang, J. Zhao, S.-Z. Chai, and Z.-Y. Zhao, “Timestamp-aligning and keyword-biasing end-to-end asr front-end for a kws system,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2021, no. 1, p. 27, 2021. [Online]. Available: https://doi.org/10.1186/s13636-021-00212-9
  9. K. Larson and D. Mowatt, “Speech error correction: The story of the alternates list,” International Journal of Speech Technology, vol. 6, no. 2, pp. 183–194, 2003. [Online]. Available: https://doi.org/10.1023/A:1022342732234
  10. G. Tur, A. Deoras, and D. Hakkani-Tür, “Semantic parsing using word confusion networks with conditional random fields,” in Proc. Interspeech 2013, 2013, pp. 2579–2583.
  11. R. W. M. Ng, K. Shah, W. Aziz, L. Specia, and T. Hain, “Quality estimation for asr k-best list rescoring in spoken language translation,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5226–5230.
  12. J. Li, “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, pp. –, 2022. [Online]. Available: http://dx.doi.org/10.1561/116.00000050
  13. R. Yang, G. Cheng, P. Zhang, and Y. Yan, “An e2e-asr-based iteratively-trained timestamp estimator,” IEEE Signal Processing Letters, vol. 29, pp. 1654–1658, 2022.
  14. A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14–22, 2012.
  15. T. N. Sainath, R. Pang, D. Rybach, B. García, and T. Strohman, “Emitting Word Timings with End-to-End Models,” in Proc. Interspeech 2020, 2020, pp. 3615–3619. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-1059
  16. R. Yang, G. Cheng, H. Miao, T. Li, P. Zhang, and Y. Yan, “Keyword search using attention-based end-to-end asr and frame-synchronous phoneme alignments,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3202–3215, 2021.
  17. Y. Shinohara and S. Watanabe, “Minimum latency training of sequence transducers for streaming end-to-end speech recognition,” in Proc. Interspeech 2022, 2022, pp. 2098–2102.
  18. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, ser. ICML ’06.   New York, NY, USA: Association for Computing Machinery, 2006, p. 369–376. [Online]. Available: https://doi.org/10.1145/1143844.1143891
  19. H. Soltau, H. Liao, and H. Sak, “Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition,” in Proc. Interspeech 2017, 2017, pp. 3707–3711.
  20. R. Collobert, A. Hannun, and G. Synnaeve, “Word-level speech recognition with a letter to word encoder,” in Proceedings of the 37th International Conference on Machine Learning, ser. ICML’20.   JMLR.org, 2020.
  21. K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is “nearest neighbor” meaningful?” in Database Theory — ICDT’99, C. Beeri and P. Buneman, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 1999, pp. 217–235.
  22. Z. Huang, T. Ng, L. Liu, H. Mason, X. Zhuang, and D. Liu, “Sndcnn: Self-normalizing deep cnns with scaled exponential linear units for speech recognition,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6854–6858.
  23. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
  24. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.   Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  25. J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, and V. Zue, “Timit acoustic-phonetic continuous speech corpus,” Linguistic Data Consortium, 11 1992.
  26. V. W. Zue and S. Seneff, “Transcription and alignment of the timit database,” in Recent Research Towards Advanced Man-Machine Interface Through Spoken Language, H. Fujisaki, Ed.   Amsterdam: Elsevier Science B.V., 1996, pp. 515–525. [Online]. Available: https://www.sciencedirect.com/science/article/pii/B9780444816078500888
  27. A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “First-pass large vocabulary continuous speech recognition using bi-directional recurrent dnns,” 2014. [Online]. Available: https://arxiv.org/abs/1408.2873
  28. A. Abdulaziz and V. Kepuska, “Noisy timit speech ldc2017s04,” Linguistic Data Consortium, 2017.
  29. H. Sak, A. Senior, K. Rao, O. İrsoy, A. Graves, F. Beaufays, and J. Schalkwyk, “Learning acoustic frame labeling for speech recognition with recurrent neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4280–4284.
  30. C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional spaces,” in Proceedings of the 8th International Conference on Database Theory, ser. ICDT ’01.   Berlin, Heidelberg: Springer-Verlag, 2001, p. 420–434.
  31. G. E. Hinton and S. Roweis, “Stochastic neighbor embedding,” in Advances in Neural Information Processing Systems, S. Becker, S. Thrun, and K. Obermayer, Eds., vol. 15.   MIT Press, 2002. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2002/file/6150ccc6069bea6b5716254057a194ef-Paper.pdf

Summary

We haven't generated a summary for this paper yet.