Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual acoustic word embeddings for zero-resource languages (2401.10543v2)

Published 19 Jan 2024 in eess.AS, cs.CL, and cs.SD

Abstract: This research addresses the challenge of developing speech applications for zero-resource languages that lack labelled data. It specifically uses acoustic word embedding (AWE) -- fixed-dimensional representations of variable-duration speech segments -- employing multilingual transfer, where labelled data from several well-resourced languages are used for pertaining. The study introduces a new neural network that outperforms existing AWE models on zero-resource languages. It explores the impact of the choice of well-resourced languages. AWEs are applied to a keyword-spotting system for hate speech detection in Swahili radio broadcasts, demonstrating robustness in real-world scenarios. Additionally, novel semantic AWE models improve semantic query-by-example search.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (137)
  1. L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Automatic speech recognition for under-resourced languages: A survey,” Speech Commun., vol. 56, pp. 85–100, 2014.
  2. Eberhard, M. David, G. F. Simons, and C. D. Fenning, “Ethnologue: Languages of the world,” 2021. [Online]. Available: https://www.ethnologue.com
  3. M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux, “The Zero Resource speech challenge 2015: Proposed approaches and results,” in Proc. SLTU, 2016.
  4. A. Jansen, E. Dupoux, S. Goldwater, M. Johnson, S. Khudanpur, K. Church, N. Feldman, H. Hermansky, F. Metze, R. Rose, M. Seltzer, P. Clark, I. McGraw, B. Varadarajan, E. Bennett, B. Borschinger, J. Chiu, E. Dunbar, A. Fourtassi, D. Harwath, C.-y. Lee, K. Levin, A. Norouzian, V. Peddinti, R. Richardson, T. Schatz, and S. Thomas, “A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition,” in Proc. ICASSP, 2013.
  5. O. Räsänen, “Computational modeling of phonetic and lexical learning in early language acquisition: Existing models and future directions,” Speech Commun., vol. 54, no. 9, pp. 975–997, 2012.
  6. K. Levin, A. Jansen, and B. Van Durme, “Segmental acoustic indexing for zero resource keyword search,” in Proc. ICASSP, 2015.
  7. S.-F. Huang, Y.-C. Chen, H.-y. Lee, and L.-s. Lee, “Improved audio embeddings by adjacency-based clustering with applications in spoken term detection,” arXiv preprint arXiv:1811.02775, 2018.
  8. Y. Yuan, C.-C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Learning acoustic word embeddings with temporal context for query-by-example speech search,” in Proc. Interspeech, 2018.
  9. A. S. Park and J. R. Glass, “Unsupervised pattern discovery in speech,” IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 1, pp. 186–197, 2008.
  10. A. Jansen and B. Van Durme, “Efficient spoken term discovery using randomized algorithms,” in Proc. ASRU, 2011.
  11. L. Ondel, H. K. Vydana, L. Burget, and J. Černocký, “Bayesian subspace hidden markov model for acoustic unit discovery,” in Proc. Interspeech, 2019.
  12. O. Räsänen and M. A. C. Blandón, “Unsupervised discovery of recurring speech patterns using probabilistic adaptive metrics,” arXiv preprint arXiv:2008.00731, 2020.
  13. H. Kamper, K. Livescu, and S. Goldwater, “An embedded segmental K-means model for unsupervised segmentation and clustering of speech,” in Proc. ASRU, 2017.
  14. S. Seshadri and O. Räsänen, “SylNet: An adaptable end-to-end syllable count estimator for speech,” IEEE Signal Process. Letters, vol. 26, no. 9, pp. 1359–1363, 2019.
  15. F. Kreuk, J. Keshet, and Y. Adi, “Self-supervised contrastive learning for unsupervised phoneme segmentation,” in Proc. Interspeech, 2020.
  16. L. Rabiner, A. Rosenberg, and S. Levinson, “Considerations in dynamic time warping algorithms for discrete word recognition,” IEEE Trans. Acoust., Speech, Signal Process., vol. 26, no. 6, pp. 575–582, 1978.
  17. K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings,” in Proc. ASRU, 2013.
  18. N. Holzenberger, M. Du, J. Karadayi, R. Riad, and E. Dupoux, “Learning word embeddings: Unsupervised methods for fixed-size representations of variable-length speech segments,” in Proc. Interspeech, 2018.
  19. Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. Lee, “Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” in Proc. Interspeech, 2016.
  20. H. Kamper, W. Wang, and K. Livescu, “Deep convolutional acoustic word embeddings using word-pair side information,” in Proc. ICASSP, 2016.
  21. H. Kamper, “Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models,” in Proc. ICASSP, 2019.
  22. H. Kamper, Y. Matusevych, and S. Goldwater, “Multilingual acoustic word embedding models for processing zero-resource languages,” in Proc. ICASSP, 2020.
  23. ——, “Improved acoustic word embeddings for zero-resource languages using multilingual transfer,” IEEE Trans. Audio, Speech, Language Process., vol. 29, pp. 1107–1118, 2021.
  24. Y. Hu, S. Settle, and K. Livescu, “Acoustic span embeddings for multilingual query-by-example search,” in Proc. SLT, 2021.
  25. ——, “Multilingual jointly trained acoustic and written word embeddings,” in Proc. Interspeech, 2020.
  26. S. Ruder, “Neural transfer learning for natural language processing,” PhD diss., NUI Galway, 2019.
  27. D. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986.
  28. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. NeurIPS, 2013.
  29. J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proc. EMNLP, 2014.
  30. C. Jacobs, Y. Matusevych, and H. Kamper, “Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation,” in Proc. SLT, 2021.
  31. G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword spotting using long short-term memory networks,” in Proc. ICASSP, 2015.
  32. S. Settle and K. Livescu, “Discriminative acoustic word embeddings: Recurrent neural network-based approaches,” in Proc. SLT, 2016.
  33. C. Doersch and A. Zisserman, “Multi-task self-supervised visual learning,” in Proc. ICCV, 2017.
  34. Y. M. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in Proc. ICLR, 2020.
  35. C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proc. ICCV, 2015.
  36. M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in Proc. ECCV, 2016.
  37. S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” in Proc. ICLR, 2018.
  38. S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” 2019.
  39. G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert, “End-to-end ASR: From supervised to semi-supervised learning with modern architectures,” in Proc. ICML, 2020.
  40. A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in Proc. ICLR, 2020.
  41. A. Baevski, M. Auli, and A. Mohamed, “Effectiveness of self-supervised pre-training for speech recognition,” in Proc. ICASSP, 2020.
  42. W. Wang, Q. Tang, and K. Livescu, “Unsupervised pre-training of bidirectional speech encoders via masked reconstruction,” in Proc. ICASSP, 2020.
  43. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proc. ICML, 2020.
  44. K. Sohn, “Improved deep metric learning with multi-class N-pair loss objective,” in Proc. NeurIPS, 2016.
  45. R. van der Merwe, “Triplet entropy loss: Improving the generalisation of short speech language identification systems,” arXiv preprint arXiv:2012.03775, 2020.
  46. J. Yi, J. Tao, Z. Wen, and Y. Bai, “Language-adversarial transfer learning for low-resource speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, no. 3, pp. 621–630, 2019.
  47. E. van der Westhuizen, T. Padhi, and T. Niesler, “Multilingual training set selection for ASR in under-resourced Malian languages,” in Proc. SPECOM, 2021.
  48. T. J. Hazen, W. Shen, and C. White, “Query-by-example spoken term detection using phonetic posteriorgram templates,” in Proc. ASRU, 2009.
  49. Y. Zhang and J. R. Glass, “Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams,” in Proc. ASRU, 2009.
  50. A. Jansen and B. V. Durme, “Indexing raw acoustic features for scalable zero resource search,” in Proc. Interspeech, 2012.
  51. A. Anastasopoulos, D. Chiang, and L. Duong, “An unsupervised probability model for speech-to-translation alignment of low-resource languages,” in Proc. EMNLP, 2016.
  52. S. Settle, K. Levin, H. Kamper, and K. Livescu, “Query-by-example search with discriminative neural acoustic word embeddings,” in Proc. Interspeech, 2017.
  53. Y. Yuan, C.-C. Leung, L. Xie, H. Chen, and B. Ma, “Query-by-example speech search using recurrent neural acoustic word embeddings with temporal context,” IEEE Access, vol. 7, pp. 67 656–67 665, 2019.
  54. D. Ram, L. Miculicich, and H. Bourlard, “Neural network based end-to-end query by example spoken term detection,” IEEE Trans. Audio, Speech, Lang. Process., vol. 28, no. 1, pp. 1416–1427, 2019.
  55. A. Saeb, R. Menon, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Very low resource radio browsing for agile developmental and humanitarian monitoring,” in Proc. Interspeech, 2017.
  56. R. Menon, A. Saeb, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Radio-browsing for developmental monitoring in Uganda,” in Proc. ICASSP, 2017.
  57. R. Menon, H. Kamper, J. Quinn, and T. Niesler, “Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring,” in Proc. Interspeech, 2018.
  58. R. Menon, H. Kamper, E. v. d. Westhuizen, J. Quinn, and T. Niesler, “Feature exploration for almost zero-resource ASR-free keyword spotting using a multilingual bottleneck extractor and correspondence autoencoders,” in Proc. Interspeech, 2019.
  59. “United Nations strategy and plan of action on hate speech.” [Online]. Available: https://www.un.org/en/genocideprevention/documents/advising-and-mobilizing/Action_plan_on_hate_speech_EN.pdf
  60. M. Larson and G. Jones, “Spoken content retrieval: A survey of techniques and technologies,” Found. Trends Inform. Retrieval, pp. 235–422, 2012.
  61. A. Mandal, K. R. Prasanna Kumar, and P. Mitra, “Recent developments in spoken term detection: A survey,” Int. J. of Speech Technol., vol. 17, pp. 183–198.
  62. E. van der Westhuizen, H. Kamper, R. Menon, J. Quinn, and T. Niesler, “Feature learning for efficient ASR-free keyword spotting in low-resource languages,” Comput. Speech Lang., vol. 71, p. 101275, 2022.
  63. Y.-A. Chung and J. Glass, “Speech2Vec: A sequence-to-sequence framework for learning word embeddings from speech,” in Proc. Interspeech, 2018.
  64. Y.-C. Chen, S.-F. Huang, C.-H. Shen, H.-y. Lee, and L.-s. Lee, “Phonetic-and-semantic embedding of spoken words with applications in spoken content retrieval,” in Proc. SLT, 2019.
  65. Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
  66. A. Shewalkar, D. Nyavanandi, and S. Ludwig, “Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU,” J. Artif. Intell. Soft Comput., vol. 9, pp. 235–245, 2019.
  67. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  68. P. Baldi, “Autoencoders, unsupervised learning, and deep architectures,” in Proc. ICML, 2012.
  69. H. Kamper, M. Elsner, A. Jansen, and S. Goldwater, “Unsupervised neural network based feature extraction using weak top-down constraints,” in Proc. ICASSP, 2015.
  70. J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a “Siamese” time delay neural network,” Int. J. Pattern Rec., vol. 7, no. 4, pp. 669–688, 1993.
  71. J. Wang, Y. song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” in Proc. CPVR, 2014.
  72. G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity through ranking,” J. Mach. Learn. Res., vol. 11, pp. 1109–1135, 2010.
  73. A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703:07737, 2017.
  74. E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in Proc. SIMBAD, 2015.
  75. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020.
  76. A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Interspeech 2021, 2021.
  77. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451–3460, 2021.
  78. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1505–1518, 2022.
  79. D. R. H. Miller, M. Kleber, C.-L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz, and H. Gish, “Rapid and accurate spoken term detection,” in Proc. Interspeech, 2007.
  80. M. Saraclar and R. Sproat, “Lattice-based search for spoken utterance retrieval,” in Proc. HLT-NAACL, 2004.
  81. P. Yu, K. Chen, C. Ma, and F. Seide, “Vocabulary-independent indexing of spontaneous speech,” IEEE Trans. Speech, Audio Process., vol. 13, no. 5, pp. 635–643, 2005.
  82. K. Ng and V. Zue, “Subword-based approaches for spoken document retrieval,” Speech Commun., 2000.
  83. N. Rajput and F. Metze, “Spoken web search,” in Proc. MediaEval Workshop, 2011.
  84. E. Barnard, M. Davel, C. van Heerden, X. Anguera, G. Gravier, and N. Rajput, “The spoken web search task,” in Proc. MediaEval Workshop, 2012.
  85. F. Metze, A. Buzo, I. Szoke, and L. J. Rodriguez-Fuentes, “The spoken web search task,” in Proc. MediaEval Workshop, 2013.
  86. X. Anguera, L. Javier, I. Szöke, A. Buzo, and F. Metze, “Query by example search on speech at Mediaeval 2014,” in Proc. MediaEval Workshop, 2014.
  87. I. Szoke, F. Metze, L. Javier, J. Proenca, A. Buzo, M. Lojka, X. Anguera, and X. Xiong, “Query by example search on speech at Mediaeval 2015,” in Proc. MediaEval Workshop, 2015.
  88. G. Mantena and X. Anguera, “Speed improvements to information retrieval-based dynamic time warping using hierarchical K-Means clustering,” in Proc. ICASSP, Oct. 2013.
  89. Y. Zhang and J. Glass, “A piecewise aggregate approximation lower-bound estimate for posteriorgram-based dynamic time warping,” in Proc. Interspeech, 2011.
  90. H. Kamper, A. Anastassiou, and K. Livescu, “Semantic query-by-example speech search using visual grounding,” in Proc. ICASSP, 2019.
  91. H. Kamper, A. Jansen, and S. Goldwater, “A segmental framework for fully-unsupervised large-vocabulary speech recognition,” Comput. Speech Lang., vol. 46, pp. 154–174, 2017.
  92. ——, “Unsupervised word segmentation and lexicon discovery using acoustic word embeddings,” IEEE Trans. Audio, Speech, Language Process., vol. 24, no. 4, pp. 669–679, 2016.
  93. L. van Staden, “Improving unsupervised acoustic word embeddings using segment- and frame-level information,” Thesis, Stellenbosch University, Stellenbosch, 2021.
  94. S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2009.
  95. M. A. Carlin, S. Thomas, A. Jansen, and H. Hermansky, “Rapid evaluation of speech representations for spoken term discovery,” in Proc. Interspeech, 2011.
  96. R. Algayres, M. S. Zaiem, B. Sagot, and E. Dupoux, “Evaluating the reliability of acoustic speech embeddings,” in Proc. Interspeech, 2020.
  97. B. M. Abdullah, M. Mosbach, I. Zaitova, B. Möbius, and D. Klakow, “Do acoustic word embeddings capture phonological similarity? An empirical study,” in Proc. Interspeech, 2021.
  98. F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. CPVR, 2015.
  99. A. Medela and A. Picon, “Constellation Loss: Improving the efficiency of deep metric learning loss functions for optimal embedding,” J. Pathol. Informat., vol. 11, no. 1, p. 38, 2020.
  100. P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in Proc. NeurIPS, 2019.
  101. T. Schultz, T. Vu, and T. Schlippe, “GlobalPhone: A multilingual text & speech database in 20 languages,” in Proc. ICASSP, 2013.
  102. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2017.
  103. Y. Matusevych, H. Kamper, and S. Goldwater, “Analyzing autoencoder-based acoustic word embeddings,” in BAICS Workshop ICLR, 2020.
  104. L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, no. 86, pp. 2579–2605, 2008.
  105. R. Bedyakin and N. Mikhaylovskiy, “Low-resource spoken language identification using self-attentive pooling and deep 1D time-channel separable convolutions,” arXiv preprint arXiv:2106.00052, 2021.
  106. I. Orife, J. Kreutzer, B. Sibanda, D. Whitenack, K. Siminyu, L. Martinus, J. T. Ali, J. Abbott, V. Marivate, S. Kabongo, M. Meressa, E. Murhabazi, O. Ahia, E. van Biljon, A. Ramkilowan, A. Akinfaderin, A. Öktem, W. Akin, G. Kioko, K. Degila, H. Kamper, B. Dossou, C. Emezue, K. Ogueji, and A. Bashir, “Masakhane – Machine translation for Africa,” in Proc. ICLR, 2020.
  107. N. Society, “Family of language,” 2020. [Online]. Available: http://www.nationalgeographic.org/encyclopedia/family-language/
  108. E. Barnard, M. Davel, C. van Heerden, F. Wet, and J. Badenhorst, “The NCHLT speech corpus of the South African languages,” in Proc. SLTU, 2014.
  109. T. Probert and M. d. Vos, “Word recognition strategies amongst isiXhosa/English bilingual learners: The interaction of orthography and language of learning and teaching,” Reading & Writing, vol. 7, no. 1, p. 10, 2016.
  110. F. Chalk, “Hate Radio in Rwanda,” in The Path of a Genocide: The Rwanda Crisis from Uganda to Zaire, 1999.
  111. K. Somerville, “Violence, hate speech and inflammatory broadcasting in Kenya: The problems of definition and identification,” Ecquid Novi: African Journalism Studies, vol. 32, no. 1, pp. 82–101, 2011.
  112. E. I. Odera, “Radio and hate speech: a comparative study of Kenya (2007 pev) and the 1994 Rwanda genocide,” Ph.D. dissertation, University of Nairobi, 2015.
  113. G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proc. ICASSP, 2014.
  114. K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, and B. Kingsbury, “End-to-end ASR-free keyword search from speech,” in Proc. ICASSP, 2017.
  115. E. T. Mekonnen, A. Brutti, and D. Falavigna, “End-to-end low resource keyword spotting through character recognition and beam-search re-scoring,” in Proc. ICASSP, 2022.
  116. E. Gauthier, L. Besacier, S. Voisin, M. Melese, and U. P. Elingui, “Collecting resources in sub-Saharan African languages for automatic speech recognition: A case study of Wolof,” in Proc. LREC, 2016.
  117. R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common Voice: A massively-multilingual speech corpus,” in Proc. LREC, 2020.
  118. A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” in Proc. Interspeech, 2022.
  119. A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.
  120. M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using Kaldi,” in Proc. Interspeech, 2017.
  121. R. Sanabria, H. Tang, and S. Goldwater, “Analyzing acoustic word embeddings from pre-trained self-supervised speech models,” in Proc. ICASSP, 2023.
  122. M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in Proc. ICASSP, 2013.
  123. T. Likhomanenko, Q. Xu, V. Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, and G. Synnaeve, “Rethinking evaluation in ASR: Are our models robust enough?” arXiv preprint arXiv:2010.11745, 2021.
  124. W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and M. Auli, “Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training,” in Proc. Interspeech, 2021.
  125. L. McInnes, J. Healy, N. Saul, and L. Großberger, “UMAP: Uniform manifold approximation and projection,” Jornal of Open Source Software, vol. 3, no. 29, p. 861, 2018.
  126. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Proc. ICLR, 2013.
  127. D. Harwath and J. Glass, “Deep multimodal semantic embeddings for speech and images,” in Proc. ASRU, 2015.
  128. H. Kamper, S. Settle, G. Shakhnarovich, and K. Livescu, “Visually grounded learning of keyword prediction from untranscribed speech,” in Proc. Interspeech, 2017.
  129. B. M. Abdullah, B. Möbius, and D. Klakow, “Integrating form and meaning: A multi-task learning model for acoustic word embeddings,” in Proc. Interspeech, 2022.
  130. G. Chen and Y. Cao, “A reality check and a practical baseline for semantic speech embeddings,” in ICASSP, 2023.
  131. H. Kamper, G. Shakhnarovich, and K. Livescu, “Semantic speech retrieval with a visually grounded model of untranscribed speech,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 27, no. 1, pp. 89–98, 2019.
  132. C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using Amazon’s mechanical turk,” in Proc. NAACL HLT, 2010.
  133. F. Hill, R. Reichart, and A. Korhonen, “SimLex-999: Evaluating semantic models with (genuine) similarity estimation,” Comput. Linguist., vol. 41, no. 4, pp. 665–695, 2015.
  134. L. van Staden and H. Kamper, “A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings,” in Proc. SLT, 2021.
  135. T. S. Fuchs, Y. Hoshen, and J. Keshet, “Unsupervised word segmentation using K nearest neighbors,” in Proc. Interspeech, 2022.
  136. S. Cuervo, M. Grabias, J. Chorowski, G. Ciesielski, A. Łańcucki, P. Rychlikowski, and R. Marxer, “Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words,” in Proc. ICASSP, 2022.
  137. H. Kamper, “Word segmentation on discovered phone units with dynamic programming and self-supervised scoring,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 684–694, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Christiaan Jacobs (7 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com