On Bilingual Lexicon Induction with Large Language Models (2310.13995v2)
Abstract: Bilingual Lexicon Induction (BLI) is a core task in multilingual NLP that still, to a large extent, relies on calculating cross-lingual word representations. Inspired by the global paradigm shift in NLP towards LLMs, we examine the potential of the latest generation of LLMs for the development of bilingual lexicons. We ask the following research question: Is it possible to prompt and fine-tune multilingual LLMs (mLLMs) for BLI, and how does this approach compare against and complement current BLI approaches? To this end, we systematically study 1) zero-shot prompting for unsupervised BLI and 2) few-shot in-context prompting with a set of seed translation pairs, both without any LLM fine-tuning, as well as 3) standard BLI-oriented fine-tuning of smaller LLMs. We experiment with 18 open-source text-to-text mLLMs of different sizes (from 0.3B to 13B parameters) on two standard BLI benchmarks covering a range of typologically diverse languages. Our work is the first to demonstrate strong BLI capabilities of text-to-text mLLMs. The results reveal that few-shot prompting with in-context examples from nearest neighbours achieves the best performance, establishing new state-of-the-art BLI scores for many language pairs. We also conduct a series of in-depth analyses and ablation studies, providing more insights on BLI with (m)LLMs, also along with their limitations.
- Normalization of language embeddings for cross-lingual alignment. In International Conference on Learning Representations.
- David Alvarez-Melis and Tommi Jaakkola. 2018. Gromov-Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1881–1890, Brussels, Belgium. Association for Computational Linguistics.
- A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789–798, Melbourne, Australia. Association for Computational Linguistics.
- Unsupervised neural machine translation. In International Conference on Learning Representations.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- Searching for needles in a haystack: On the role of incidental bilingualism in palm’s translation capability. arXiv preprint arXiv:2305.10266.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Improving the lexical ability of pretrained language models for unsupervised neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 173–180, Online. Association for Computational Linguistics.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 526–533, Barcelona, Spain.
- Dictionary-based phrase-level prompting of large language models for machine translation. arXiv preprint arXiv:2302.07856.
- How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 710–721, Florence, Italy. Association for Computational Linguistics.
- Goran Glavaš and Ivan Vulić. 2020. Non-linear instance-based cross-lingual mapping for non-isomorphic embedding spaces. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7548–7555, Online. Association for Computational Linguistics.
- It’s not Greek to mBERT: Inducing word-level translations from multilingual BERT. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 45–56, Online. Association for Computational Linguistics.
- Learning unsupervised multilingual word embeddings with incremental multilingual hubs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1890–1902, Minneapolis, Minnesota. Association for Computational Linguistics.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Bilex rx: Lexical data augmentation for massively multilingual machine translation. arXiv preprint arXiv:2303.15265.
- Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2979–2984, Brussels, Belgium. Association for Computational Linguistics.
- Word translation without parallel data. In Proceedings of the International Conference on Learning Representations.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Translation-enhanced multilingual text-to-image generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9174–9193, Toronto, Canada. Association for Computational Linguistics.
- Improving word translation via two-stage contrastive learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4353–4374, Dublin, Ireland. Association for Computational Linguistics.
- Improving bilingual lexicon induction with cross-encoder reranking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4100–4116, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Emergent communication pretraining for few-shot machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4716–4731, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations.
- Quantifying the carbon emissions of machine learning. In NeurIPS 2019 Workshop on Tackling Climate Change with Machine Learning.
- When does unsupervised machine translation work? In Proceedings of the Fifth Conference on Machine Translation, pages 571–583, Online. Association for Computational Linguistics.
- IsoVec: Controlling the relative isomorphism of word embedding spaces. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6019–6033, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
- LNMap: Departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2712–2723, Online. Association for Computational Linguistics.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
- Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 184–193, Florence, Italy. Association for Computational Linguistics.
- Cross-lingual word embedding refinement by ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm optimisation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2690–2701, Online. Association for Computational Linguistics.
- Language models are unsupervised multitask learners. OpenAI Blog.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- A survey of cross-lingual word embedding models. Journal of Artificial Inteligence Research, 65:569–631.
- Filtered inner product projection for crosslingual embedding alignment. In International Conference on Learning Representations.
- mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580.
- Cross-cultural similarity features for cross-lingual transfer learning of pragmatically motivated tasks. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2403–2414, Online. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Prompting palm for translation: Assessing strategies and performance. arXiv preprint arXiv:2211.09102.
- Probing cross-lingual lexical knowledge from multilingual sentence encoders. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2089–2105, Dubrovnik, Croatia. Association for Computational Linguistics.
- Do we really need fully unsupervised cross-lingual embeddings? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4407–4418, Hong Kong, China. Association for Computational Linguistics.
- Probing pretrained language models for lexical semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7222–7240, Online. Association for Computational Linguistics.
- Expanding pretrained models to thousands more languages via lexicon-based adaptation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 863–877, Dublin, Ireland. Association for Computational Linguistics.
- CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association.
- Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1006–1011, Denver, Colorado. Association for Computational Linguistics.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Combining static word embeddings and contextual representations for bilingual lexicon induction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2943–2955, Online. Association for Computational Linguistics.
- A relaxed matching procedure for unsupervised BLI. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3036–3041, Online. Association for Computational Linguistics.
- Improving zero-shot cross-lingual transfer for multilingual question answering over knowledge graph. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5822–5834, Online. Association for Computational Linguistics.
- Yaoyiran Li (9 papers)
- Anna Korhonen (90 papers)
- Ivan Vulić (130 papers)