Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding (2310.06103v1)
Abstract: A number of methods have been proposed for End-to-End Spoken Language Understanding (E2E-SLU) using pretrained models, however their evaluation often lacks multilingual setup and tasks that require prediction of lexical fillers, such as slot filling. In this work, we propose a unified method that integrates multilingual pretrained speech and text models and performs E2E-SLU on six datasets in four languages in a generative manner, including the prediction of lexical fillers. We investigate how the proposed method can be improved by pretraining on widely available speech recognition data using several training objectives. Pretraining on 7000 hours of multilingual data allows us to outperform the state-of-the-art ultimately on two SLU datasets and partly on two more SLU datasets. Finally, we examine the cross-lingual capabilities of the proposed model and improve on the best known result on the PortMEDIA-Language dataset by almost half, achieving a Concept/Value Error Rate of 23.65%.
- “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension” In Proc. ACL, 2020, pp. 7871–7880
- “wav2vec 2.0: A framework for self-supervised learning of speech representations” In Advances in Neural Information Processing Systems 33, 2020, pp. 12449–12460
- Pavel Denisov and Ngoc Thang Vu “Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning” In Proc. Interspeech 2020, 2020, pp. 881–885
- “SLURP: A Spoken Language Understanding Resource Package” In Proc. EMNLP, 2020, pp. 7252–7262
- “SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech” In Proc. ICASSP, 2022, pp. 7927–7931
- “Exploring Transfer Learning For End-to-End Spoken Language Understanding” In Proceedings of the AAAI Conference on Artificial Intelligence 35.15, 2021, pp. 13754–13761
- “End2End Acoustic to Semantic Transduction” In Proc. ICASSP, 2021, pp. 7448–7452
- “On the Use of Semantically-Aligned Speech Representations for Spoken Language Understanding” In Proc. SLT, 2023, pp. 361–368
- “ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet” In Proc. ICASSP, 2022, pp. 7167–7171
- Mutian He and Philip N Garner “The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation” In arXiv preprint arXiv:2305.09652, 2023
- “Fine-Grained Textual Knowledge Transfer to Improve RNN Transducers for Speech Recognition and Understanding” In Proc. ICASSP, 2023, pp. 1–5
- “LegoNN: Building Modular Encoder-Decoder Models” In arXiv preprint arXiv:2206.03318, 2022
- Xiang Lisa Li and Percy Liang “Prefix-Tuning: Optimizing Continuous Prompts for Generation” In Proc. ACL, 2021, pp. 4582–4597
- “Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning” In Advances in Neural Information Processing Systems 34, 2021, pp. 18513–18526
- “Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models” In Proc. Interspeech, 2021, pp. 2996–3000
- “Barlow twins: Self-supervised learning via redundancy reduction” In Proc. ICML, 2021, pp. 12310–12320
- “CATSLU: The 1st Chinese Audio-Textual Spoken Language Understanding Challenge” In 2019 International Conference on Multimodal Interaction, 2019, pp. 521–525
- “The French MEDIA/EVALDA Project: the Evaluation of the Understanding Capability of Spoken Language Dialogue Systems.” In LREC, 2004
- “Leveraging study of robustness and portability of spoken language understanding systems across languages and domains: the PORTMEDIA corpora” In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 2012
- “Common Voice: A Massively-Multilingual Speech Corpus” In Proceedings of the 12th Language Resources and Evaluation Conference, 2020, pp. 4218–4222
- “WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition” In Proc. ICASSP, 2022, pp. 6182–6186
- “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale” In arXiv preprint arXiv:2111.09296, 2021
- “Multilingual translation with extensible multilingual pretraining and finetuning” In arXiv preprint arXiv:2008.00401, 2020
- “SUPERB: Speech processing Universal PERformance Benchmark” In Proc. Interspeech 2021, 2021, pp. 1194–1198
- “An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition” In Proc. ASRU, 2021, pp. 228–235
- “Conformer: Convolution-augmented Transformer for Speech Recognition” In Proc. Interspeech 2020, 2020, pp. 5036–5040
- “Recent Developments on ESPnet Toolkit Boosted by Conformer” In Proc. ICASSP, 2021, pp. 5874–5878
- “Multilingual Speech Translation from Efficient Finetuning of Pretrained Models” In Proc. ACL, 2021, pp. 827–838
- “Audio augmentation for speech recognition” In Sixteenth annual conference of the international speech communication association, 2015
- Diederik P. Kingma and Jimmy Ba “Adam: A Method for Stochastic Optimization” In Proc. ICLR, 2015
- “Efficient Sequence Transduction by Jointly Predicting Tokens and Durations” In arXiv preprint arXiv:2304.06795, 2023
- “A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding” In Proc. SLT, 2023, pp. 406–413
- “ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding” In IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, 2022, pp. 1207–1218
- “Where are we in semantic concept extraction for Spoken Language Understanding?” In International Conference on Speech and Computer, 2021, pp. 202–213
- “Curriculum-Based Transfer Learning for an Effective End-to-End Spoken Language Understanding and Domain Portability” In Proc. Interspeech, 2019, pp. 1198–1202
- Pavel Denisov (19 papers)
- Ngoc Thang Vu (93 papers)