Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations (2403.08738v1)

Published 13 Mar 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Acoustic word embeddings (AWEs) are vector representations of spoken words. An effective method for obtaining AWEs is the Correspondence Auto-Encoder (CAE). In the past, the CAE method has been associated with traditional MFCC features. Representations obtained from self-supervised learning (SSL)-based speech models such as HuBERT, Wav2vec2, etc., are outperforming MFCC in many downstream tasks. However, they have not been well studied in the context of learning AWEs. This work explores the effectiveness of CAE with SSL-based speech representations to obtain improved AWEs. Additionally, the capabilities of SSL-based speech models are explored in cross-lingual scenarios for obtaining AWEs. Experiments are conducted on five languages: Polish, Portuguese, Spanish, French, and English. HuBERT-based CAE model achieves the best results for word discrimination in all languages, despite Hu-BERT being pre-trained on English only. Also, the HuBERT-based CAE model works well in cross-lingual settings. It outperforms MFCC-based CAE models trained on the target languages when trained on one source language and tested on target languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. How familiar does that sound? cross-lingual representational similarity analysis of acoustic word embeddings. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 407–419, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  2. Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study. In Proc. Interspeech 2021, pages 4194–4198.
  3. Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings. In Proc. Interspeech 2022, pages 1876–1880.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc.
  5. Keyword spotting based on the analysis of template matching distances. In 2011 5th International Conference on Signal Processing and Communication Systems (ICSPCS), pages 1–6.
  6. Rapid evaluation of speech representations for spoken term discovery. In Proc. Interspeech 2011, pages 821–824.
  7. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
  8. Unsupervised cross-lingual representation learning for speech recognition. In Interspeech.
  9. S Davis and P Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech, Signal Process., 28(4):357–366.
  10. Multi-view recurrent neural acoustic word embeddings. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
  11. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  12. Acoustic span embeddings for multilingual query-by-example search. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 935–942.
  13. Towards hate speech detection in low-resource languages: Comparing asr to acoustic word embeddings on wolof and swahili. ArXiv, abs/2306.00410.
  14. Herman Kamper. 2019. Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6535–3539.
  15. Unsupervised neural network based feature extraction using weak top-down constraints. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5818–5822.
  16. Improved acoustic word embeddings for zero-resource languages using multilingual transfer. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:1107–1118.
  17. Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder. In Proc. INTERSPEECH 2023, pages 2988–2992.
  18. Analyzing autoencoder-based acoustic word embeddings. CoRR, abs/2004.01647.
  19. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech.
  20. Amit Meghanani and Thomas Hain. 2024. Score: Self-supervised correspondence fine-tuning for improved content representations. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  21. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
  22. Mls: A large-scale multilingual dataset for speech research. Interspeech 2020.
  23. Analyzing acoustic word embeddings from pre-trained self-supervised speech models. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  24. Globalphone: A multilingual text & speech database in 20 languages. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8126–8130.
  25. Query-by-example search with discriminative neural acoustic word embeddings. In INTERSPEECH.
  26. Learning audio-text agreement for open-vocabulary keyword spotting. In Interspeech.
  27. Superb: Speech processing universal performance benchmark. In Interspeech.
  28. Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search. In Proc. Interspeech 2018, pages 97–101.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Amit Meghanani (5 papers)
  2. Thomas Hain (58 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.