Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition (2401.02417v1)

Published 4 Jan 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: While word error rates of automatic speech recognition (ASR) systems have consistently fallen, natural language understanding (NLU) applications built on top of ASR systems still attribute significant numbers of failures to low-quality speech recognition results. Existing assistant systems collect large numbers of these unsuccessful interactions, but these systems usually fail to learn from these interactions, even in an offline fashion. In this work, we introduce CLC: Contrastive Learning for Conversations, a family of methods for contrastive fine-tuning of models in a self-supervised fashion, making use of easily detectable artifacts in unsuccessful conversations with assistants. We demonstrate that our CLC family of approaches can improve the performance of ASR models on OD3, a new public large-scale semi-synthetic meta-dataset of audio task-oriented dialogues, by up to 19.2%. These gains transfer to real-world systems as well, where we show that CLC can help to improve performance by up to 6.7% over baselines. We make OD3 publicly available at https://github.com/amazon-science/amazon-od3 .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. W.-C. Kwan et al., “A survey on recent advances and challenges in reinforcement learning methods for task-oriented dialogue policy learning,” Machine Intel. Res., vol. 20, no. 3, 2023.
  2. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
  3. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, 2020.
  4. W.-N. Hsu, Y.-H. H. Tsai, B. Bolte, R. Salakhutdinov, and A. Mohamed, “Hubert: How much can a bad teacher benefit asr pre-training?” in ICASSP, 2021.
  5. D. M. Chan, S. Ghosh, D. Chakrabarty, and B. Hoffmeister, “Multi-modal pre-training for automated speech recognition,” in ICASSP, 2022.
  6. D. M. Chan, S. Ghosh, A. Rastrow, and B. Hoffmeister, “Domain adaptation with external off-policy acoustic catalogs for scalable contextual end-to-end automated speech recognition,” in ICASSP, 2023.
  7. S. Mitra, S. N. Ray, B. Padi, R. Bilgi, H. Arsikere, S. Ghosh, A. Srinivasamurthy, and S. Garimella, “Unified modeling of multi-domain multi-device ASR systems,” in TSD, 2023.
  8. B. Min et al., “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, 2021.
  9. S.-Y. Chang et al., “Context-aware end-to-end asr using self-attentive embedding and tensor fusion,” in ICASSP, 2023.
  10. S. Si, W. Ma, H. Gao, Y. Wu, T.-E. Lin, Y. Dai, H. Li, R. Yan, F. Huang, and Y. Li, “Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dialogue agents,” arXiv:2305.13040, 2023.
  11. S. Kim and F. Metze, “Dialog-context aware end-to-end speech recognition,” in SLT, 2018.
  12. F.-J. Chang, J. Liu, M. Radfar, A. Mouchtaris, M. Omologo, A. Rastrow, and S. Kunzmann, “Context-aware transformer transducer for speech recognition,” in ASRU, 2021.
  13. Z. Chen, M. Jain, Y. Wang, M. L. Seltzer, and C. Fuegen, “Joint grapheme and phoneme embeddings for contextual end-to-end asr.” in Interspeech, 2019.
  14. K. M. Sathyendra, T. Muniyappa, F.-J. Chang, J. Liu, J. Su, G. P. Strimel, A. Mouchtaris, and S. Kunzmann, “Contextual adapters for personalized speech recognition in neural transducers,” in ICASSP, 2022.
  15. K. Wei et al., “Attentive contextual carryover for multi-turn end-to-end spoken language understanding,” in 2021 ASRU.   IEEE, 2021, pp. 837–844.
  16. C.-H. H. Yang, Y.-L. Gu, Y.-C. Liu, S. Ghosh, I. Bulyko, and A. Stolcke, “Generative asr error correction with large language models,” in ASRU, 2023.
  17. D. M. Chan, S. Ghosh, A. Rastrow, and B. Hoffmeister., “Using external off-policy speech-to-text mappings in contextual end-to-end automated speech recognition,” in ICASSP, 2023.
  18. S. Mahadevan, B. Mishra, and S. Ghosh, “A unified framework for domain adaptation using metric learning on manifolds,” in ECML PKDD, 2019.
  19. D. M. Chan and S. Ghosh, “Content-context factorized representations for automated speech recognition,” in Interspeech, 2022.
  20. P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” in arXiv:2004.11362, 2021.
  21. M. Eric, L. Krishnan, F. Charette, and C. D. Manning, “Key-value retrieval networks for task-oriented dialogue,” in SIGDIAL, 2017.
  22. P. Budzianowski et al., “MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling,” pp. 5016–5026, Oct.-Nov. 2018. [Online]. Available: https://aclanthology.org/D18-1547
  23. C. Zhao, S. Gella, S. Kim, D. Jin, D. Hazarika, A. Papangelis, B. Hedayatnia, M. Namazifar, Y. Liu, and D. Hakkani-Tur, “”what do others think?”: Task-oriented conversational modeling with subjective knowledge,” in arxiv:2305.12091, 2023.
  24. C. Gunasekara et al., “Noesis ii: Predicting responses, identifying success, and managing complexity in task-oriented dialogue,” in AAAI: Workshop on Dialog System Tech Challenges, 2020.
  25. S. Kottur, S. Moon, A. Geramifard, and B. Damavandi, “SIMMC 2.0: A task-oriented dialog dataset for immersive multimodal conversations,” in EMNLP, 2021.
  26. G.-T. Lin, Y.-S. Chuang, H.-L. Chung, S. wen Yang, H.-J. Chen, S. A. Dong, S.-W. Li, A. Mohamed, H. yi Lee, and L. shan Lee, “DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering,” in Interspeech, 2022.
  27. O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V. Lavrukhin, J. Cook et al., “Nemo: a toolkit for building ai applications using neural modules,” arXiv:1909.09577, 2019.
  28. E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti, “YourTTS: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in ICML, 2022.
  29. R. Ardila et al., “Common voice: A massively-multilingual speech corpus,” pp. 4218–4222, May 2020. [Online]. Available: https://aclanthology.org/2020.lrec-1.520
  30. M. N. Team. (2023) Introducing mpt-30b: Raising the bar for open-source foundation models. Accessed: 2023-06-22. [Online]. Available: www.mosaicml.com/blog/mpt-30b
  31. M. Henderson, B. Thomson, and J. D. Williams, “The second dialog state tracking challenge,” in SIGDIAL, 2014.
  32. S. Kim et al., “”how robust ru?”: Evaluating task-oriented dialogue systems on spoken conversations,” in ASRU, 2021.
  33. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv:2005.08100, 2020.
  34. H. Pham et al., “Combined scaling for zero-shot transfer learning,” Neurocomputing, 2023.
  35. E. Tsunoo, Y. Kashiwagi, and S. Watanabe, “Streaming transformer asr with blockwise synchronous beam search,” in SLT, 2021.
  36. T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” in ICLR, 2019.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com