When your Cousin has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages (2305.14012v2)
Abstract: Most existing approaches for unsupervised bilingual lexicon induction (BLI) depend on good quality static or contextual embeddings requiring large monolingual corpora for both languages. However, unsupervised BLI is most likely to be useful for low-resource languages (LRLs), where large datasets are not available. Often we are interested in building bilingual resources for LRLs against related high-resource languages (HRLs), resulting in severely imbalanced data settings for BLI. We first show that state-of-the-art BLI methods in the literature exhibit near-zero performance for severely data-imbalanced language pairs, indicating that these settings require more robust techniques. We then present a new method for unsupervised BLI between a related LRL and HRL that only requires inference on a masked LLM of the HRL, and demonstrate its effectiveness on truly low-resource languages Bhojpuri and Magahi (with <5M monolingual tokens each), against Hindi. We further present experiments on (mid-resource) Marathi and Nepali to compare approach performances by resource range, and release our resulting lexicons for five low-resource Indic languages: Bhojpuri, Magahi, Awadhi, Braj, and Maithili, against Hindi.
- Cross-Lingual Word Embeddings for Low-Resource Language Modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 937–947, Valencia, Spain. Association for Computational Linguistics.
- Learning Principled Bilingual Mappings of Word Embeddings While Preserving Monolingual Invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2289–2294, Austin, Texas. Association for Computational Linguistics.
- Learning Bilingual Word Embeddings with (Almost) No Bilingual Data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 451–462, Vancouver, Canada. Association for Computational Linguistics.
- A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789–798, Melbourne, Australia. Association for Computational Linguistics.
- Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
- Bilingual Lexicon Induction through Unsupervised Machine Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5002–5007, Florence, Italy. Association for Computational Linguistics.
- Combining Noisy Semantic Signals with Orthographic Cues: Cognate Induction for the Indic Dialect Continuum. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 110–131, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5:135–146.
- Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages. ACM Transactions on Asian and Low-Resource Language Information Processing, 19(5):69:1–69:15.
- Word Translation Without Parallel Data. arXiv preprint: arXiv:1710.04087.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Enabling language models to fill in the blanks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2492–2501, Online. Association for Computational Linguistics.
- Syntactic Transfer Using a Bilingual Lexicon. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1–11, Jeju Island, Korea. Association for Computational Linguistics.
- A Simple, Fast, and Effective Reparameterization of IBM Model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–648, Atlanta, Georgia. Association for Computational Linguistics.
- Anchor-based Bilingual Word Embeddings for Low-Resource Languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 227–232, Online. Association for Computational Linguistics.
- Goran Glavaš and Ivan Vulić. 2020. Non-Linear Instance-Based Cross-Lingual Mapping for Non-Isomorphic Embedding Spaces. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7548–7555, Online. Association for Computational Linguistics.
- It’s not Greek to mBERT: Inducing Word-Level Translations from Multilingual BERT. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 45–56, Online. Association for Computational Linguistics.
- Learning Bilingual Lexicons from Monolingual Corpora. In Proceedings of ACL-08: HLT, pages 771–779, Columbus, Ohio. Association for Computational Linguistics.
- Ann Irvine and Chris Callison-Burch. 2013. Combining Bilingual and Comparable Corpora for Low Resource Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 262–270, Sofia, Bulgaria. Association for Computational Linguistics.
- Sanjay Kumar Jha. 2019. Exploring the Degree of Similarities between Hindi and Maithili Words from Glottochronological Perspective. International Journal of Innovations in TESOL and Applied Linguistics, 5.
- Raviraj Joshi. 2023. L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. arXiv preprint: arXiv:2211.11418.
- MuRIL: Multilingual Representations for Indian Languages.
- Cross-Lingual Word Embeddings for Turkic Languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4054–4062, Marseille, France. European Language Resources Association.
- Rabindra Lamsal. 2020. A Large Scale Nepali Text Corpus.
- Socially Aware Bias Measurements for Hindi Language Representations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1041–1052, Seattle, United States. Association for Computational Linguistics.
- Efficient Estimation of Word Representations in Vector Space. arXiv preprint: arXiv:1301.3781.
- Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications. ACM Transactions on Asian and Low-Resource Language Information Processing, 20:1–37.
- Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 184–193, Florence, Italy. Association for Computational Linguistics.
- Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- A Survey of Cross-Lingual Word Embedding Models. Journal of Artificial Intelligence Research, 65(1):569–630.
- On the Limitations of Unsupervised Bilingual Dictionary Induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 778–788, Melbourne, Australia. Association for Computational Linguistics.
- HABLex: Human Annotated Bilingual Lexicons for Experiments in Machine Translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1382–1387, Hong Kong, China. Association for Computational Linguistics.
- Do We Really Need Fully Unsupervised Cross-Lingual Embeddings? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4407–4418, Hong Kong, China. Association for Computational Linguistics.
- Unsupervised Multilingual Word Embedding with Limited Resources using Neural Language Models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3113–3124, Florence, Italy. Association for Computational Linguistics.
- Do Not Neglect Related Languages: The Case of Low-Resource Occitan Cross-Lingual Word Embeddings. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 41–50, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Shijie Wu and Mark Dredze. 2020. Are All Languages Created Equal in Multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.
- Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1006–1011, Denver, Colorado. Association for Computational Linguistics.
- Interactive Refinement of Cross-Lingual Word Embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5984–5996, Online. Association for Computational Linguistics.
- Combining Static Word Embeddings and Contextual Representations for Bilingual Lexicon Induction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2943–2955, Online. Association for Computational Linguistics.
- Cross Language Dependency Parsing using a Bilingual Lexicon. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 55–63, Suntec, Singapore. Association for Computational Linguistics.
- Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. European Language Resources Association (ELRA). PID https://aclanthology.org/L12-1154/.
- inlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. PID https://indicnlp.ai4bharat.org/home/.
- Lamsal, Rabindra. 2020. A Large Scale Nepali Text Corpus. IEEEdataport. PID https://doi.org/10.21227/jxrd-d245.
- Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications. ACM Transactions on Asian and Low-Resource Language Information Processing. PID https://github.com/singhakr/Bhojpuri-Magahi-and-Maithili-Linguistic-Resources.
- Ojha, Atul Kr. 2019. English-Bhojpuri SMT System: Insights from the Karaka Model. arXiv. PID http://arxiv.org/abs/1905.02239.
- Findings of the LoResMT 2020 Shared Task on Zero-Shot for Low-Resource languages. Association for Computational Linguistics. PID https://aclanthology.org/2020.loresmt-1.4.
- Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). Association for Computational Linguistics. PID https://aclanthology.org/W18-3900.
- Niyati Bafna (5 papers)
- Cristina España-Bonet (19 papers)
- Josef van Genabith (43 papers)
- Benoît Sagot (60 papers)
- Rachel Bawden (25 papers)