Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding (2405.15097v1)

Published 23 May 2024 in cs.CL and cs.AI

Abstract: Recently, deep end-to-end learning has been studied for intent classification in Spoken Language Understanding (SLU). However, end-to-end models require a large amount of speech data with intent labels, and highly optimized models are generally sensitive to the inconsistency between the training and evaluation conditions. Therefore, a natural language understanding approach based on Automatic Speech Recognition (ASR) remains attractive because it can utilize a pre-trained general LLM and adapt to the mismatch of the speech input environment. Using this module-based approach, we improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors. We propose a two-stage method, Contrastive and Consistency Learning (CCL), that correlates error patterns between clean and noisy ASR transcripts and emphasizes the consistency of the latent features of the two transcripts. Experiments on four benchmark datasets show that CCL outperforms existing methods and improves the ASR robustness in various noisy environments. Code is available at https://github.com/syoung7388/CCL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
  3. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015.
  4. SLURP: A spoken language understanding resource package. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7252–7262, Online. Association for Computational Linguistics.
  5. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
  6. Sequential end-to-end intent and slot label classification and localization. arXiv preprint arXiv:2106.04660.
  7. Ya-Hsin Chang and Yun-Nung Chen. 2022. Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding. In Proc. Interspeech 2022, pages 3458–3462.
  8. Pre-Training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning. In Proc. Interspeech 2021, pages 1244–1248.
  9. Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909.
  10. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
  11. Spoken language understanding without speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6189–6193. IEEE.
  12. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  14. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910.
  15. Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 753–757.
  16. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  17. Chao-Wei Huang and Yun-Nung Chen. 2020. Learning asr-robust contextualized embeddings for spoken language understanding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8009–8013. IEEE.
  18. Improved spoken language representation for intent understanding in a task-oriented dialogue system. Sensors, 22(4):1509.
  19. Two-stage textual knowledge distillation for end-to-end spoken language understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7463–7467. IEEE.
  20. Towards semi-supervised semantics understanding from speech. arXiv preprint arXiv:2011.06195.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR.
  22. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
  23. Bing Liu and Ian Lane. 2016. Attention-based recurrent neural network models for joint intent detection and slot filling. Interspeech 2016.
  24. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  25. Timers and such: A practical benchmark for spoken language understanding with numbers. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks., pages 1–11.
  26. Speech model pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670.
  27. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  28. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  29. Speech-language pre-training for end-to-end spoken language understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7458–7462. IEEE.
  30. A co-interactive transformer for joint slot filling and intent detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8193–8197. IEEE.
  31. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  32. Towards an asr error robust spoken language understanding system.
  33. End-to-End Spoken Language Understanding for Generalized Voice Assistants. In Proc. Interspeech 2021, pages 4738–4742.
  34. Integration of pre-trained networks with continuous token interface for end-to-end spoken language understanding. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7152–7156. IEEE.
  35. Towards end-to-end spoken language understanding. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5754–5758. IEEE.
  36. Tacl: Improving bert pre-training with token-aware contrastive learning. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2497–2507.
  37. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  38. Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR.
  39. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36.
  40. Antgpt: Can large language models help long-term action anticipation from videos? In The Twelfth International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Suyoung Kim (11 papers)
  2. Jiyeon Hwang (1 paper)
  3. Ho-Young Jung (7 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets