2000 character limit reached
How Lexical is Bilingual Lexicon Induction? (2404.04221v1)
Published 5 Apr 2024 in cs.CL
Abstract: In contemporary machine learning approaches to bilingual lexicon induction (BLI), a model learns a mapping between the embedding spaces of a language pair. Recently, retrieve-and-rank approach to BLI has achieved state of the art results on the task. However, the problem remains challenging in low-resource settings, due to the paucity of data. The task is complicated by factors such as lexical variation across languages. We argue that the incorporation of additional lexical information into the recent retrieve-and-rank approach should improve lexicon induction. We demonstrate the efficacy of our proposed approach on XLING, improving over the previous state of the art by an average of 2\% across all language pairs.
- Eneko Agirre. 2020. Cross-Lingual Word Embeddings. Computational Linguistics, 46(1):245–248.
- On difficulties of cross-lingual transfer with order differences: A case study on dependency parsing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2440–2452, Minneapolis, Minnesota. Association for Computational Linguistics.
- David Alvarez-Melis and Tommi Jaakkola. 2018. Gromov-Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1881–1890, Brussels, Belgium. Association for Computational Linguistics.
- Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2289–2294, Austin, Texas. Association for Computational Linguistics.
- Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In AAAI Conference on Artificial Intelligence.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, page 129–136, New York, NY, USA. Association for Computing Machinery.
- Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA. ACM.
- Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116.
- Word translation without parallel data. arXiv preprint arXiv:1710.04087.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Bilingual dictionary based neural machine translation without using parallel sentences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1570–1579, Online. Association for Computational Linguistics.
- Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Conference of the European Chapter of the Association for Computational Linguistics.
- How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 710–721, Florence, Italy. Association for Computational Linguistics.
- Unsupervised alignment of embeddings with wasserstein procrustes. In International Conference on Artificial Intelligence and Statistics.
- Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1234–1244, Beijing, China. Association for Computational Linguistics.
- spaCy: Industrial-strength Natural Language Processing in Python.
- Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2979–2984, Brussels, Belgium. Association for Computational Linguistics.
- Classification-based self-learning for weakly supervised bilingual lexicon induction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6915–6922, Online. Association for Computational Linguistics.
- Inducing crosslingual distributed representations of words. In Proceedings of COLING 2012, pages 1459–1474, Mumbai, India. The COLING 2012 Organizing Committee.
- Improving word translation via two-stage contrastive learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4353–4374, Dublin, Ireland. Association for Computational Linguistics.
- Improving bilingual lexicon induction with cross-encoder reranking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4100–4116, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
- LNMap: Departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2712–2723, Online. Association for Computational Linguistics.
- Analyzing the limitations of cross-lingual word embedding mappings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4990–4995, Florence, Italy. Association for Computational Linguistics.
- Revisiting the linearity in cross-lingual embedding mappings: from a perspective of word analogies. CoRR, abs/2004.01079.
- When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535, New Orleans, Louisiana. Association for Computational Linguistics.
- Overview of the fourth BUCC shared task: Bilingual dictionary induction from comparable corpora. In Proceedings of the 13th Workshop on Building and Using Comparable Corpora, pages 6–13, Marseille, France. European Language Resources Association.
- Bilingual lexicon induction via unsupervised bitext construction and word alignment. CoRR, abs/2101.00148.
- On the limitations of unsupervised bilingual dictionary induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 778–788, Melbourne, Australia. Association for Computational Linguistics.
- Robyn Speer. 2022. rspeer/wordfreq: v3.0.
- Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, page 363–372, New York, NY, USA. Association for Computing Machinery.
- Normalized word embedding and orthogonal transform for bilingual word translation. In North American Chapter of the Association for Computational Linguistics.
- Interactive refinement of cross-lingual word embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5984–5996, Online. Association for Computational Linguistics.