Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding (2310.06103v1)

Published 9 Oct 2023 in cs.CL, cs.SD, and eess.AS

Abstract: A number of methods have been proposed for End-to-End Spoken Language Understanding (E2E-SLU) using pretrained models, however their evaluation often lacks multilingual setup and tasks that require prediction of lexical fillers, such as slot filling. In this work, we propose a unified method that integrates multilingual pretrained speech and text models and performs E2E-SLU on six datasets in four languages in a generative manner, including the prediction of lexical fillers. We investigate how the proposed method can be improved by pretraining on widely available speech recognition data using several training objectives. Pretraining on 7000 hours of multilingual data allows us to outperform the state-of-the-art ultimately on two SLU datasets and partly on two more SLU datasets. Finally, we examine the cross-lingual capabilities of the proposed model and improve on the best known result on the PortMEDIA-Language dataset by almost half, achieving a Concept/Value Error Rate of 23.65%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension” In Proc. ACL, 2020, pp. 7871–7880
  2. “wav2vec 2.0: A framework for self-supervised learning of speech representations” In Advances in Neural Information Processing Systems 33, 2020, pp. 12449–12460
  3. Pavel Denisov and Ngoc Thang Vu “Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning” In Proc. Interspeech 2020, 2020, pp. 881–885
  4. “SLURP: A Spoken Language Understanding Resource Package” In Proc. EMNLP, 2020, pp. 7252–7262
  5. “SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech” In Proc. ICASSP, 2022, pp. 7927–7931
  6. “Exploring Transfer Learning For End-to-End Spoken Language Understanding” In Proceedings of the AAAI Conference on Artificial Intelligence 35.15, 2021, pp. 13754–13761
  7. “End2End Acoustic to Semantic Transduction” In Proc. ICASSP, 2021, pp. 7448–7452
  8. “On the Use of Semantically-Aligned Speech Representations for Spoken Language Understanding” In Proc. SLT, 2023, pp. 361–368
  9. “ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet” In Proc. ICASSP, 2022, pp. 7167–7171
  10. Mutian He and Philip N Garner “The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation” In arXiv preprint arXiv:2305.09652, 2023
  11. “Fine-Grained Textual Knowledge Transfer to Improve RNN Transducers for Speech Recognition and Understanding” In Proc. ICASSP, 2023, pp. 1–5
  12. “LegoNN: Building Modular Encoder-Decoder Models” In arXiv preprint arXiv:2206.03318, 2022
  13. Xiang Lisa Li and Percy Liang “Prefix-Tuning: Optimizing Continuous Prompts for Generation” In Proc. ACL, 2021, pp. 4582–4597
  14. “Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning” In Advances in Neural Information Processing Systems 34, 2021, pp. 18513–18526
  15. “Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models” In Proc. Interspeech, 2021, pp. 2996–3000
  16. “Barlow twins: Self-supervised learning via redundancy reduction” In Proc. ICML, 2021, pp. 12310–12320
  17. “CATSLU: The 1st Chinese Audio-Textual Spoken Language Understanding Challenge” In 2019 International Conference on Multimodal Interaction, 2019, pp. 521–525
  18. “The French MEDIA/EVALDA Project: the Evaluation of the Understanding Capability of Spoken Language Dialogue Systems.” In LREC, 2004
  19. “Leveraging study of robustness and portability of spoken language understanding systems across languages and domains: the PORTMEDIA corpora” In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 2012
  20. “Common Voice: A Massively-Multilingual Speech Corpus” In Proceedings of the 12th Language Resources and Evaluation Conference, 2020, pp. 4218–4222
  21. “WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition” In Proc. ICASSP, 2022, pp. 6182–6186
  22. “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale” In arXiv preprint arXiv:2111.09296, 2021
  23. “Multilingual translation with extensible multilingual pretraining and finetuning” In arXiv preprint arXiv:2008.00401, 2020
  24. “SUPERB: Speech processing Universal PERformance Benchmark” In Proc. Interspeech 2021, 2021, pp. 1194–1198
  25. “An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition” In Proc. ASRU, 2021, pp. 228–235
  26. “Conformer: Convolution-augmented Transformer for Speech Recognition” In Proc. Interspeech 2020, 2020, pp. 5036–5040
  27. “Recent Developments on ESPnet Toolkit Boosted by Conformer” In Proc. ICASSP, 2021, pp. 5874–5878
  28. “Multilingual Speech Translation from Efficient Finetuning of Pretrained Models” In Proc. ACL, 2021, pp. 827–838
  29. “Audio augmentation for speech recognition” In Sixteenth annual conference of the international speech communication association, 2015
  30. Diederik P. Kingma and Jimmy Ba “Adam: A Method for Stochastic Optimization” In Proc. ICLR, 2015
  31. “Efficient Sequence Transduction by Jointly Predicting Tokens and Durations” In arXiv preprint arXiv:2304.06795, 2023
  32. “A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding” In Proc. SLT, 2023, pp. 406–413
  33. “ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding” In IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, 2022, pp. 1207–1218
  34. “Where are we in semantic concept extraction for Spoken Language Understanding?” In International Conference on Speech and Computer, 2021, pp. 202–213
  35. “Curriculum-Based Transfer Learning for an Effective End-to-End Spoken Language Understanding and Domain Portability” In Proc. Interspeech, 2019, pp. 1198–1202
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Pavel Denisov (19 papers)
  2. Ngoc Thang Vu (93 papers)
Citations (1)