Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Substitution-based Word Sense Induction (2405.11086v1)

Published 17 May 2024 in cs.CL

Abstract: Word Sense Induction (WSI) is the task of discovering senses of an ambiguous word by grouping usages of this word into clusters corresponding to these senses. Many approaches were proposed to solve WSI in English and a few other languages, but these approaches are not easily adaptable to new languages. We present multilingual substitution-based WSI methods that support any of 100 languages covered by the underlying multilingual LLM with minimal to no adaptation required. Despite the multilingual capabilities, our methods perform on par with the existing monolingual approaches on popular English WSI datasets. At the same time, they will be most useful for lower-resourced languages which miss lexical resources available for English, thus, have higher demand for unsupervised methods like WSI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Autosense model for word sense induction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6212–6219.
  2. Asaf Amrami and Yoav Goldberg. 2018. Word sense induction with neural bilm and symmetric patterns. arXiv preprint arXiv:1808.08518.
  3. Asaf Amrami and Yoav Goldberg. 2019. Towards better substitution-based word sense induction. arXiv preprint arXiv:1905.12598.
  4. Polylm: Learning about polysemy through language modeling. arXiv preprint arXiv:2101.10448.
  5. Combining lexical substitutes in neural word sense induction. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 62–70.
  6. Always keep your target in mind: Studying semantics and improving performance of neural lexical substitution. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1242–1255.
  7. Nikolay V Arefyev and Dmitry A. Bykov. 2021. Unsupervised cross-lingual representation learning at scale. In Proceedings of the International Conference on Computational Linguistics and Intellectual Technologies (Dialogue).
  8. Ai-ku: Using substitute vectors and co-occurrence modeling for word sense induction and disambiguation. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 300–306.
  9. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
  10. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
  11. T. Caliński and J Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics, 3(1):1–27.
  12. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
  13. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  14. Large scale substitution-based word sense induction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4738–4752.
  15. Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification, 2(1):193–218.
  16. Alexandros Komninos and Suresh Manandhar. 2016. Structured generative models of continuous features for word sense induction. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3577–3587.
  17. Mikhail Korobov. 2015a. Morphological analyzer and generator for russian and ukrainian languages. In Mikhail Yu. Khachay, Natalia Konstantinova, Alexander Panchenko, Dmitry I. Ignatov, and Valeri G. Labunets, editors, Analysis of Images, Social Networks and Texts, volume 542 of Communications in Computer and Information Science, pages 320–332. Springer International Publishing.
  18. Mikhail Korobov. 2015b. Morphological analyzer and generator for russian and ukrainian languages. In Analysis of Images, Social Networks and Texts: 4th International Conference, AIST 2015, Yekaterinburg, Russia, April 9–11, 2015, Revised Selected Papers 4, pages 320–332. Springer.
  19. unimelb: Topic modelling-based word sense induction for web snippet clustering. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 217–221.
  20. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  21. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
  22. Dominik Schlechtweg. 2022. Human and Computational Measurement of Lexical Semantic Change. Ph.D. thesis, University of Stuttgart.
  23. A sense-topic model for word sense induction with unsupervised data enrichment. Transactions of the Association for Computational Linguistics, 3:59–71.
  24. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
  25. RuDSI: Graph-based Word Sense Induction Dataset for Russian. PID https://github.com/kategavrishina/RuDSI.
  26. Présentation de l’atelier SemDis 2014 : sémantique distributionnelle pour la substitution lexicale et l’exploration de corpus spécialisés. Laboratoire Cognition, Langues, Langage, Ergonomie (CLLE) UMR 5263. PID http://redac.univ-tlse2.fr/datasets/semdis-gold/lexicalsubstitution/.
  27. SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses. Association for Computational Linguistics. PID https://lcl.uniroma1.it/wsdeval/evaluation-data.
  28. What substitutes tell us-analysis of an “all-words” lexical substitution corpus. Institute for Natural Language Processing, University of Stuttgart. PID https://www.ims.uni-stuttgart.de/en/research/resources/corpora/coinco/.
  29. SemEval-2010 Task 14: Word Sense Induction &Disambiguation. Association for Computational Linguistics. PID https://doi.org/10.5281/zenodo.5638549.
  30. SemEval-2007 Task 10: English Lexical Substitution Task. Association for Computational Linguistics.
  31. Miller, George A. 1995. WordNet: a lexical database for English. ACM New York, NY, USA. PID https://wordnet.princeton.edu/.
  32. GermEval 2015: LexSub–A shared task for German-language lexical substitution. (1) Humboldt University Berlin, (2) FG Language Technology, (3) Ubiquitous Knowledge Processing Lab (UKP-TUDA) Dept. of Computer Science, Technische Universitat Darmstadt ¨ (4) Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research and Educational Information. PID https://www.nothingisreal.com/germeval2015/data-sets.html.
  33. RUSSE’2018: A Shared Task on Word Sense Induction for the Russian Language. RSUH. PID https://russe.nlpub.org/2018/wsi/.
  34. XL-WSD: An Extra-Large and Cross-Lingual Evaluation Framework for Word Sense Disambiguation. Sapienza NLP group. PID https://sapienzanlp.github.io/xl-wsd/.
  35. Dwug: A large resource of diachronic word usage graphs in four languages. arXiv preprint arXiv:2104.08540.
  36. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. European Language Resources Association. PID https://huggingface.co/datasets/cc100.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets