PWESuite: Phonetic Word Embeddings and Tasks They Facilitate (2304.02541v4)
Abstract: Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embedding methods, we also contribute a task suite to fairly evaluate past, current, and future methods. We evaluate both (1) intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and (2) extrinsic performance on tasks such as rhyme and cognate detection and sound analogies. We hope our task suite will promote reproducibility and inspire future phonetic embedding research.
- Felipe Almeida and Geraldo Xexéo. 2019. Word embeddings: A survey. arXiv:1901.09069.
- Amir Bakarov. 2018. A survey of word embeddings evaluation methods. arXiv:1801.09536.
- CogNet: A large-scale cognate database. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3136–3145.
- Metric learning. Morgan & Claypool.
- Phonologically aware neural model for named entity recognition in low resource transfer settings. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1462–1472.
- Leonard Bloomfield. 1993. Language. University of Chicago Press.
- Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5:135–146.
- Jose Camacho-Collados and Mohammad Taher Pilehvar. 2018. From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research, 63:743–788.
- Adapting word embeddings to new languages with morphological and phonological subword representations. arXiv:1808.09500.
- Pre-training for spoken language understanding with joint textual and phonetic representation learning. In Interspeech 2021. ISCA.
- Phonetic-and-semantic embedding of spoken words with applications in spoken content retrieval. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 941–948.
- Noam Chomsky and Morris Halle. 1968. The Sound Pattern of English. Harper & Row.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.
- Vishal Gupta Fahd Saleh Alotaibi, Saurabh Sharma and Savita Gupta. 2022. Keyphrase extraction using enhanced word and document embedding. IETE Journal of Research, 0(0):1–13.
- Using phoneme representations to build predictive models robust to ASR errors. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, page 699–708. Association for Computing Machinery.
- Quantifying cognitive factors in lexical decline. Transactions of the Association for Computational Linguistics, 9:1529–1545.
- Evaluation of acoustic word embeddings. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 62–66.
- Diachronic word embeddings reveal statistical laws of semantic change. arXiv preprint arXiv:1605.09096.
- Multilingual jointly trained acoustic and written word embeddings. arXiv:2006.14007.
- Preliminaries to Speech Analysis: The Distinctive Features and their Correlates. Language.
- Mahmut Kaya and Hasan Şakir Bilge. 2019. Deep metric learning: A survey. Symmetry, 11:1066.
- Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR.
- Hierarchical phone recognition with compositional phonetics. In Interspeech, pages 2461–2465.
- Jeff Mielke. 2008. The emergence of distinctive features. Oxford University Press.
- Efficient estimation of word representations in vector space. arXiv:1301.3781.
- Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751.
- Nicole Mirea and Klinton Bicknell. 2019. Using LSTMs to assess the obligatoriness of phonological distinctive features for phonotactic learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1595–1605.
- PanPhon: A resource for mapping IPA segments to articulatory feature vectors. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3475–3484.
- Phonetic, semantic, and articulatory features in Assamese-Bengali cognate detection. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 41–53. Association for Computational Linguistics.
- A generalized method for automated multilingual loanword detection. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4996–5013.
- Allison Parrish. 2017. Poetic sound similarity vectors using phonetic features. In Thirteenth Artificial Intelligence and Interactive Digital Entertainment Conference.
- GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing, pages 1532–1543.
- Taraka Rama. 2016. Siamese convolutional networks for cognate identification. In Proceedings of COLING, the 26th International Conference on Computational Linguistics, pages 1018–1027.
- Keyphrase extraction from disaster-related tweets. In The world wide web conference, pages 1555–1566.
- David Romero and Christian Salamea. 2021. On the use of phonotactic vector representations with fasttext for language identification. Conversational Dialogue Systems for the Next Decade, pages 339–348.
- Where new words are born: Distributional semantic analysis of neologisms and their semantic neighborhoods. In Proceedings of the Society for Computation in Linguistics, volume 3.
- SIGTYP 2021 shared task: Robust spoken language identification.
- Phonetic word embeddings. arXiv:2109.14796.
- Sound analogies with phoneme embeddings. In Proceedings of the Society for Computation in Linguistics (SCiL), pages 136–144.
- One embedder, any task: Instruction-finetuned text embeddings. arXiv:2212.09741.
- Sameerah Talafha and Banafsheh Rekabdar. 2021. Poetry generation model via deep learning incorporating extended phonetic and semantic embeddings. In 2021 IEEE 15th International Conference on Semantic Computing (ICSC), pages 48–55.
- Spelling error correction with BERT based on character-phonetic. In 2020 IEEE 6th International Conference on Computer and Communications (ICCC), pages 1146–1150.
- Nikolai Trubetskoy. 1939. Grundzüge der Phonologie, volume VII. Travaux du Cercle Linguistique de Prague.
- Attention is all you need. Advances in neural information processing systems, 30.
- Paul C Vitz and Brenda Spiegel Winkler. 1973. Predicting the judged “similarity of sound” of English words. Journal of Verbal Learning and Verbal Behavior, 12(4):373–388.
- Liu Yang and Rong Jin. 2006. Distance metric learning: A comprehensive survey. Michigan State Universiy, 2(2):4.
- Zixiaofan Yang and Julia Hirschberg. 2019. Linguistically-informed training of acoustic word embeddings for low-resource languages. In Interspeech, pages 2678–2682.
- Chinese poetry generation with a working memory model.
- A self-supervised model for language identification integrating phonological knowledge. Electronics, 10(18).
- Correcting chinese spelling errors with phonetic pre-training. In Findings of the Association for Computational Linguistics 2021, pages 2250–2261.
- Incorporating syntactic and phonetic information into multimodal word embeddings using graph convolutional networks. In ICASSP International Conference on Acoustics, Speech and Signal Processing, pages 7588–7592. IEEE.
- Learning multimodal word representations by explicitly embedding syntactic and phonetic information. IEEE Access, 8:223306–223315.
- Unsupervised Cross-lingual Representation Learning at Scale.
- Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
- Carnegie Mellon Speech Group. 2014. The Carnegie Mellon Pronouncing Dictionary 0.7b. Carnegie Mellon University.
- Benjamin Heinzerling and Michael Strube. 2018. BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association.
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. European Language Resources Association.