Multilingual acoustic word embeddings for zero-resource languages (2401.10543v2)
Abstract: This research addresses the challenge of developing speech applications for zero-resource languages that lack labelled data. It specifically uses acoustic word embedding (AWE) -- fixed-dimensional representations of variable-duration speech segments -- employing multilingual transfer, where labelled data from several well-resourced languages are used for pertaining. The study introduces a new neural network that outperforms existing AWE models on zero-resource languages. It explores the impact of the choice of well-resourced languages. AWEs are applied to a keyword-spotting system for hate speech detection in Swahili radio broadcasts, demonstrating robustness in real-world scenarios. Additionally, novel semantic AWE models improve semantic query-by-example search.
- L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Automatic speech recognition for under-resourced languages: A survey,” Speech Commun., vol. 56, pp. 85–100, 2014.
- Eberhard, M. David, G. F. Simons, and C. D. Fenning, “Ethnologue: Languages of the world,” 2021. [Online]. Available: https://www.ethnologue.com
- M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux, “The Zero Resource speech challenge 2015: Proposed approaches and results,” in Proc. SLTU, 2016.
- A. Jansen, E. Dupoux, S. Goldwater, M. Johnson, S. Khudanpur, K. Church, N. Feldman, H. Hermansky, F. Metze, R. Rose, M. Seltzer, P. Clark, I. McGraw, B. Varadarajan, E. Bennett, B. Borschinger, J. Chiu, E. Dunbar, A. Fourtassi, D. Harwath, C.-y. Lee, K. Levin, A. Norouzian, V. Peddinti, R. Richardson, T. Schatz, and S. Thomas, “A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition,” in Proc. ICASSP, 2013.
- O. Räsänen, “Computational modeling of phonetic and lexical learning in early language acquisition: Existing models and future directions,” Speech Commun., vol. 54, no. 9, pp. 975–997, 2012.
- K. Levin, A. Jansen, and B. Van Durme, “Segmental acoustic indexing for zero resource keyword search,” in Proc. ICASSP, 2015.
- S.-F. Huang, Y.-C. Chen, H.-y. Lee, and L.-s. Lee, “Improved audio embeddings by adjacency-based clustering with applications in spoken term detection,” arXiv preprint arXiv:1811.02775, 2018.
- Y. Yuan, C.-C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Learning acoustic word embeddings with temporal context for query-by-example speech search,” in Proc. Interspeech, 2018.
- A. S. Park and J. R. Glass, “Unsupervised pattern discovery in speech,” IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 1, pp. 186–197, 2008.
- A. Jansen and B. Van Durme, “Efficient spoken term discovery using randomized algorithms,” in Proc. ASRU, 2011.
- L. Ondel, H. K. Vydana, L. Burget, and J. Černocký, “Bayesian subspace hidden markov model for acoustic unit discovery,” in Proc. Interspeech, 2019.
- O. Räsänen and M. A. C. Blandón, “Unsupervised discovery of recurring speech patterns using probabilistic adaptive metrics,” arXiv preprint arXiv:2008.00731, 2020.
- H. Kamper, K. Livescu, and S. Goldwater, “An embedded segmental K-means model for unsupervised segmentation and clustering of speech,” in Proc. ASRU, 2017.
- S. Seshadri and O. Räsänen, “SylNet: An adaptable end-to-end syllable count estimator for speech,” IEEE Signal Process. Letters, vol. 26, no. 9, pp. 1359–1363, 2019.
- F. Kreuk, J. Keshet, and Y. Adi, “Self-supervised contrastive learning for unsupervised phoneme segmentation,” in Proc. Interspeech, 2020.
- L. Rabiner, A. Rosenberg, and S. Levinson, “Considerations in dynamic time warping algorithms for discrete word recognition,” IEEE Trans. Acoust., Speech, Signal Process., vol. 26, no. 6, pp. 575–582, 1978.
- K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings,” in Proc. ASRU, 2013.
- N. Holzenberger, M. Du, J. Karadayi, R. Riad, and E. Dupoux, “Learning word embeddings: Unsupervised methods for fixed-size representations of variable-length speech segments,” in Proc. Interspeech, 2018.
- Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. Lee, “Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” in Proc. Interspeech, 2016.
- H. Kamper, W. Wang, and K. Livescu, “Deep convolutional acoustic word embeddings using word-pair side information,” in Proc. ICASSP, 2016.
- H. Kamper, “Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models,” in Proc. ICASSP, 2019.
- H. Kamper, Y. Matusevych, and S. Goldwater, “Multilingual acoustic word embedding models for processing zero-resource languages,” in Proc. ICASSP, 2020.
- ——, “Improved acoustic word embeddings for zero-resource languages using multilingual transfer,” IEEE Trans. Audio, Speech, Language Process., vol. 29, pp. 1107–1118, 2021.
- Y. Hu, S. Settle, and K. Livescu, “Acoustic span embeddings for multilingual query-by-example search,” in Proc. SLT, 2021.
- ——, “Multilingual jointly trained acoustic and written word embeddings,” in Proc. Interspeech, 2020.
- S. Ruder, “Neural transfer learning for natural language processing,” PhD diss., NUI Galway, 2019.
- D. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986.
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. NeurIPS, 2013.
- J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proc. EMNLP, 2014.
- C. Jacobs, Y. Matusevych, and H. Kamper, “Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation,” in Proc. SLT, 2021.
- G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword spotting using long short-term memory networks,” in Proc. ICASSP, 2015.
- S. Settle and K. Livescu, “Discriminative acoustic word embeddings: Recurrent neural network-based approaches,” in Proc. SLT, 2016.
- C. Doersch and A. Zisserman, “Multi-task self-supervised visual learning,” in Proc. ICCV, 2017.
- Y. M. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in Proc. ICLR, 2020.
- C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proc. ICCV, 2015.
- M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in Proc. ECCV, 2016.
- S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” in Proc. ICLR, 2018.
- S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” 2019.
- G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert, “End-to-end ASR: From supervised to semi-supervised learning with modern architectures,” in Proc. ICML, 2020.
- A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in Proc. ICLR, 2020.
- A. Baevski, M. Auli, and A. Mohamed, “Effectiveness of self-supervised pre-training for speech recognition,” in Proc. ICASSP, 2020.
- W. Wang, Q. Tang, and K. Livescu, “Unsupervised pre-training of bidirectional speech encoders via masked reconstruction,” in Proc. ICASSP, 2020.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proc. ICML, 2020.
- K. Sohn, “Improved deep metric learning with multi-class N-pair loss objective,” in Proc. NeurIPS, 2016.
- R. van der Merwe, “Triplet entropy loss: Improving the generalisation of short speech language identification systems,” arXiv preprint arXiv:2012.03775, 2020.
- J. Yi, J. Tao, Z. Wen, and Y. Bai, “Language-adversarial transfer learning for low-resource speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, no. 3, pp. 621–630, 2019.
- E. van der Westhuizen, T. Padhi, and T. Niesler, “Multilingual training set selection for ASR in under-resourced Malian languages,” in Proc. SPECOM, 2021.
- T. J. Hazen, W. Shen, and C. White, “Query-by-example spoken term detection using phonetic posteriorgram templates,” in Proc. ASRU, 2009.
- Y. Zhang and J. R. Glass, “Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams,” in Proc. ASRU, 2009.
- A. Jansen and B. V. Durme, “Indexing raw acoustic features for scalable zero resource search,” in Proc. Interspeech, 2012.
- A. Anastasopoulos, D. Chiang, and L. Duong, “An unsupervised probability model for speech-to-translation alignment of low-resource languages,” in Proc. EMNLP, 2016.
- S. Settle, K. Levin, H. Kamper, and K. Livescu, “Query-by-example search with discriminative neural acoustic word embeddings,” in Proc. Interspeech, 2017.
- Y. Yuan, C.-C. Leung, L. Xie, H. Chen, and B. Ma, “Query-by-example speech search using recurrent neural acoustic word embeddings with temporal context,” IEEE Access, vol. 7, pp. 67 656–67 665, 2019.
- D. Ram, L. Miculicich, and H. Bourlard, “Neural network based end-to-end query by example spoken term detection,” IEEE Trans. Audio, Speech, Lang. Process., vol. 28, no. 1, pp. 1416–1427, 2019.
- A. Saeb, R. Menon, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Very low resource radio browsing for agile developmental and humanitarian monitoring,” in Proc. Interspeech, 2017.
- R. Menon, A. Saeb, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Radio-browsing for developmental monitoring in Uganda,” in Proc. ICASSP, 2017.
- R. Menon, H. Kamper, J. Quinn, and T. Niesler, “Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring,” in Proc. Interspeech, 2018.
- R. Menon, H. Kamper, E. v. d. Westhuizen, J. Quinn, and T. Niesler, “Feature exploration for almost zero-resource ASR-free keyword spotting using a multilingual bottleneck extractor and correspondence autoencoders,” in Proc. Interspeech, 2019.
- “United Nations strategy and plan of action on hate speech.” [Online]. Available: https://www.un.org/en/genocideprevention/documents/advising-and-mobilizing/Action_plan_on_hate_speech_EN.pdf
- M. Larson and G. Jones, “Spoken content retrieval: A survey of techniques and technologies,” Found. Trends Inform. Retrieval, pp. 235–422, 2012.
- A. Mandal, K. R. Prasanna Kumar, and P. Mitra, “Recent developments in spoken term detection: A survey,” Int. J. of Speech Technol., vol. 17, pp. 183–198.
- E. van der Westhuizen, H. Kamper, R. Menon, J. Quinn, and T. Niesler, “Feature learning for efficient ASR-free keyword spotting in low-resource languages,” Comput. Speech Lang., vol. 71, p. 101275, 2022.
- Y.-A. Chung and J. Glass, “Speech2Vec: A sequence-to-sequence framework for learning word embeddings from speech,” in Proc. Interspeech, 2018.
- Y.-C. Chen, S.-F. Huang, C.-H. Shen, H.-y. Lee, and L.-s. Lee, “Phonetic-and-semantic embedding of spoken words with applications in spoken content retrieval,” in Proc. SLT, 2019.
- Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
- A. Shewalkar, D. Nyavanandi, and S. Ludwig, “Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU,” J. Artif. Intell. Soft Comput., vol. 9, pp. 235–245, 2019.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- P. Baldi, “Autoencoders, unsupervised learning, and deep architectures,” in Proc. ICML, 2012.
- H. Kamper, M. Elsner, A. Jansen, and S. Goldwater, “Unsupervised neural network based feature extraction using weak top-down constraints,” in Proc. ICASSP, 2015.
- J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a “Siamese” time delay neural network,” Int. J. Pattern Rec., vol. 7, no. 4, pp. 669–688, 1993.
- J. Wang, Y. song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” in Proc. CPVR, 2014.
- G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity through ranking,” J. Mach. Learn. Res., vol. 11, pp. 1109–1135, 2010.
- A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703:07737, 2017.
- E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in Proc. SIMBAD, 2015.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020.
- A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Interspeech 2021, 2021.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451–3460, 2021.
- S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1505–1518, 2022.
- D. R. H. Miller, M. Kleber, C.-L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz, and H. Gish, “Rapid and accurate spoken term detection,” in Proc. Interspeech, 2007.
- M. Saraclar and R. Sproat, “Lattice-based search for spoken utterance retrieval,” in Proc. HLT-NAACL, 2004.
- P. Yu, K. Chen, C. Ma, and F. Seide, “Vocabulary-independent indexing of spontaneous speech,” IEEE Trans. Speech, Audio Process., vol. 13, no. 5, pp. 635–643, 2005.
- K. Ng and V. Zue, “Subword-based approaches for spoken document retrieval,” Speech Commun., 2000.
- N. Rajput and F. Metze, “Spoken web search,” in Proc. MediaEval Workshop, 2011.
- E. Barnard, M. Davel, C. van Heerden, X. Anguera, G. Gravier, and N. Rajput, “The spoken web search task,” in Proc. MediaEval Workshop, 2012.
- F. Metze, A. Buzo, I. Szoke, and L. J. Rodriguez-Fuentes, “The spoken web search task,” in Proc. MediaEval Workshop, 2013.
- X. Anguera, L. Javier, I. Szöke, A. Buzo, and F. Metze, “Query by example search on speech at Mediaeval 2014,” in Proc. MediaEval Workshop, 2014.
- I. Szoke, F. Metze, L. Javier, J. Proenca, A. Buzo, M. Lojka, X. Anguera, and X. Xiong, “Query by example search on speech at Mediaeval 2015,” in Proc. MediaEval Workshop, 2015.
- G. Mantena and X. Anguera, “Speed improvements to information retrieval-based dynamic time warping using hierarchical K-Means clustering,” in Proc. ICASSP, Oct. 2013.
- Y. Zhang and J. Glass, “A piecewise aggregate approximation lower-bound estimate for posteriorgram-based dynamic time warping,” in Proc. Interspeech, 2011.
- H. Kamper, A. Anastassiou, and K. Livescu, “Semantic query-by-example speech search using visual grounding,” in Proc. ICASSP, 2019.
- H. Kamper, A. Jansen, and S. Goldwater, “A segmental framework for fully-unsupervised large-vocabulary speech recognition,” Comput. Speech Lang., vol. 46, pp. 154–174, 2017.
- ——, “Unsupervised word segmentation and lexicon discovery using acoustic word embeddings,” IEEE Trans. Audio, Speech, Language Process., vol. 24, no. 4, pp. 669–679, 2016.
- L. van Staden, “Improving unsupervised acoustic word embeddings using segment- and frame-level information,” Thesis, Stellenbosch University, Stellenbosch, 2021.
- S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2009.
- M. A. Carlin, S. Thomas, A. Jansen, and H. Hermansky, “Rapid evaluation of speech representations for spoken term discovery,” in Proc. Interspeech, 2011.
- R. Algayres, M. S. Zaiem, B. Sagot, and E. Dupoux, “Evaluating the reliability of acoustic speech embeddings,” in Proc. Interspeech, 2020.
- B. M. Abdullah, M. Mosbach, I. Zaitova, B. Möbius, and D. Klakow, “Do acoustic word embeddings capture phonological similarity? An empirical study,” in Proc. Interspeech, 2021.
- F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. CPVR, 2015.
- A. Medela and A. Picon, “Constellation Loss: Improving the efficiency of deep metric learning loss functions for optimal embedding,” J. Pathol. Informat., vol. 11, no. 1, p. 38, 2020.
- P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in Proc. NeurIPS, 2019.
- T. Schultz, T. Vu, and T. Schlippe, “GlobalPhone: A multilingual text & speech database in 20 languages,” in Proc. ICASSP, 2013.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2017.
- Y. Matusevych, H. Kamper, and S. Goldwater, “Analyzing autoencoder-based acoustic word embeddings,” in BAICS Workshop ICLR, 2020.
- L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, no. 86, pp. 2579–2605, 2008.
- R. Bedyakin and N. Mikhaylovskiy, “Low-resource spoken language identification using self-attentive pooling and deep 1D time-channel separable convolutions,” arXiv preprint arXiv:2106.00052, 2021.
- I. Orife, J. Kreutzer, B. Sibanda, D. Whitenack, K. Siminyu, L. Martinus, J. T. Ali, J. Abbott, V. Marivate, S. Kabongo, M. Meressa, E. Murhabazi, O. Ahia, E. van Biljon, A. Ramkilowan, A. Akinfaderin, A. Öktem, W. Akin, G. Kioko, K. Degila, H. Kamper, B. Dossou, C. Emezue, K. Ogueji, and A. Bashir, “Masakhane – Machine translation for Africa,” in Proc. ICLR, 2020.
- N. Society, “Family of language,” 2020. [Online]. Available: http://www.nationalgeographic.org/encyclopedia/family-language/
- E. Barnard, M. Davel, C. van Heerden, F. Wet, and J. Badenhorst, “The NCHLT speech corpus of the South African languages,” in Proc. SLTU, 2014.
- T. Probert and M. d. Vos, “Word recognition strategies amongst isiXhosa/English bilingual learners: The interaction of orthography and language of learning and teaching,” Reading & Writing, vol. 7, no. 1, p. 10, 2016.
- F. Chalk, “Hate Radio in Rwanda,” in The Path of a Genocide: The Rwanda Crisis from Uganda to Zaire, 1999.
- K. Somerville, “Violence, hate speech and inflammatory broadcasting in Kenya: The problems of definition and identification,” Ecquid Novi: African Journalism Studies, vol. 32, no. 1, pp. 82–101, 2011.
- E. I. Odera, “Radio and hate speech: a comparative study of Kenya (2007 pev) and the 1994 Rwanda genocide,” Ph.D. dissertation, University of Nairobi, 2015.
- G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proc. ICASSP, 2014.
- K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, and B. Kingsbury, “End-to-end ASR-free keyword search from speech,” in Proc. ICASSP, 2017.
- E. T. Mekonnen, A. Brutti, and D. Falavigna, “End-to-end low resource keyword spotting through character recognition and beam-search re-scoring,” in Proc. ICASSP, 2022.
- E. Gauthier, L. Besacier, S. Voisin, M. Melese, and U. P. Elingui, “Collecting resources in sub-Saharan African languages for automatic speech recognition: A case study of Wolof,” in Proc. LREC, 2016.
- R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common Voice: A massively-multilingual speech corpus,” in Proc. LREC, 2020.
- A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” in Proc. Interspeech, 2022.
- A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.
- M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using Kaldi,” in Proc. Interspeech, 2017.
- R. Sanabria, H. Tang, and S. Goldwater, “Analyzing acoustic word embeddings from pre-trained self-supervised speech models,” in Proc. ICASSP, 2023.
- M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in Proc. ICASSP, 2013.
- T. Likhomanenko, Q. Xu, V. Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, and G. Synnaeve, “Rethinking evaluation in ASR: Are our models robust enough?” arXiv preprint arXiv:2010.11745, 2021.
- W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and M. Auli, “Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training,” in Proc. Interspeech, 2021.
- L. McInnes, J. Healy, N. Saul, and L. Großberger, “UMAP: Uniform manifold approximation and projection,” Jornal of Open Source Software, vol. 3, no. 29, p. 861, 2018.
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Proc. ICLR, 2013.
- D. Harwath and J. Glass, “Deep multimodal semantic embeddings for speech and images,” in Proc. ASRU, 2015.
- H. Kamper, S. Settle, G. Shakhnarovich, and K. Livescu, “Visually grounded learning of keyword prediction from untranscribed speech,” in Proc. Interspeech, 2017.
- B. M. Abdullah, B. Möbius, and D. Klakow, “Integrating form and meaning: A multi-task learning model for acoustic word embeddings,” in Proc. Interspeech, 2022.
- G. Chen and Y. Cao, “A reality check and a practical baseline for semantic speech embeddings,” in ICASSP, 2023.
- H. Kamper, G. Shakhnarovich, and K. Livescu, “Semantic speech retrieval with a visually grounded model of untranscribed speech,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 27, no. 1, pp. 89–98, 2019.
- C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using Amazon’s mechanical turk,” in Proc. NAACL HLT, 2010.
- F. Hill, R. Reichart, and A. Korhonen, “SimLex-999: Evaluating semantic models with (genuine) similarity estimation,” Comput. Linguist., vol. 41, no. 4, pp. 665–695, 2015.
- L. van Staden and H. Kamper, “A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings,” in Proc. SLT, 2021.
- T. S. Fuchs, Y. Hoshen, and J. Keshet, “Unsupervised word segmentation using K nearest neighbors,” in Proc. Interspeech, 2022.
- S. Cuervo, M. Grabias, J. Chorowski, G. Ciesielski, A. Łańcucki, P. Rychlikowski, and R. Marxer, “Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words,” in Proc. ICASSP, 2022.
- H. Kamper, “Word segmentation on discovered phone units with dynamic programming and self-supervised scoring,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 684–694, 2023.
- Christiaan Jacobs (7 papers)