Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models (2305.13684v2)

Published 23 May 2023 in cs.CL

Abstract: Recent multilingual pretrained LLMs (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLMSim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLMSim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. On the nature of discrete speech representations in multilingual self-supervised models. In Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 159–161.
  2. Alexandra Y. Aikhenvald and R. M. W. Dixon. 2001. Areal diffusion and genetic inheritance. Oxford University Press, Oxford.
  3. XLS-R: self-supervised cross-lingual speech representation learning at scale. CoRR, abs/2111.09296.
  4. Johannes Bjerva and Isabelle Augenstein. 2018. From phonology to syntax: Unsupervised linguistic typology at different levels with language embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 907–916. Association for Computational Linguistics.
  5. The geometry of multilingual language model representations. CoRR, abs/2205.10964.
  6. Infoxlm: An information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 3576–3588. Association for Computational Linguistics.
  7. Improving pretrained cross-lingual language models via self-labeled word alignment. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 3418–3430. Association for Computational Linguistics.
  8. Rochelle Choenni and Ekaterina Shutova. 2022. Investigating language relationships in multilingual sentence encoders through the lens of linguistic typology. Comput. Linguistics, 48(3):635–672.
  9. Canine: Pre-training an efficient tokenization-free encoder for language representation. Trans. Assoc. Comput. Linguistics, 10:73–91.
  10. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451. Association for Computational Linguistics.
  11. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 7057–7067.
  12. FLEURS: few-shot learning evaluation of universal representations of speech. CoRR, abs/2205.12446.
  13. No language left behind: Scaling human-centered machine translation. CoRR, abs/2207.04672.
  14. Universal dependencies. Comput. Linguistics, 47(2):255–308.
  15. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  16. Matthew S Dryer and Martin Haspelmath. 2013. The world atlas of language structures online.
  17. Beyond english-centric multilingual machine translation. J. Mach. Learn. Res., 22:107:1–107:48.
  18. Discovering representation sprachbund for multilingual pre-training. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pages 881–894. Association for Computational Linguistics.
  19. MASSIVE: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages. CoRR, abs/2204.08582.
  20. Glottolog 3.0. Max Planck Institute for the Science of Human History.
  21. Martin Haspelmath. 2004. How hopeless is genealogical linguistics, and how advanced is areal linguistics? Studies in Language, 28(1):209–223.
  22. Automated dating of the world’s language families based on lexical similarity. Current Anthropology, 52(6):841–875.
  23. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421. PMLR.
  24. Theoretical linguistics rivals embeddings in language clustering for multilingual named entity recognition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, ACL 2023, Toronto, Canada, July 9-14, 2023, pages 139–151. Association for Computational Linguistics.
  25. Glot500: Scaling multilingual corpora and language models to 500 languages.
  26. What does BERT learn about the structure of language? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3651–3657. Association for Computational Linguistics.
  27. From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 4483–4499. Association for Computational Linguistics.
  28. Locating language-specific information in contextualized embeddings. CoRR, abs/2109.08040.
  29. On the language neutrality of pre-trained multilingual representations. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1663–1674. Association for Computational Linguistics.
  30. Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3125–3135. Association for Computational Linguistics.
  31. URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 8–14. Association for Computational Linguistics.
  32. Taxi1500: A multilingual dataset for text classification in 1500 languages.
  33. Thomas Mayer and Michael Cysouw. 2014. Creating a massively parallel bible corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26-31, 2014, pages 3158–3163. European Language Resources Association (ELRA).
  34. Steven Moran and Daniel McCloy, editors. 2019. PHOIBLE 2.0. Max Planck Institute for the Science of Human History, Jena.
  35. Phoible online.
  36. Niklas Muennighoff. 2022. SGPT: GPT sentence embeddings for semantic search. CoRR, abs/2202.08904.
  37. First align, then predict: Understanding the cross-lingual ability of multilingual BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 2214–2231. Association for Computational Linguistics.
  38. Cross-lingual retrieval augmented prompt for low-resource languages. CoRR, abs/2212.09651.
  39. Robert Östling and Jörg Tiedemann. 2017. Continuous multilinguality with language vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 644–649. Association for Computational Linguistics.
  40. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1946–1958. Association for Computational Linguistics.
  41. Identifying the correlation between language distance and cross-lingual transfer in a multilingual representation space. CoRR, abs/2305.02151.
  42. How multilingual is multilingual bert? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4996–5001. Association for Computational Linguistics.
  43. Probing multilingual BERT for genetic and typological signals. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 1214–1228. International Committee on Computational Linguistics.
  44. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–3990. Association for Computational Linguistics.
  45. Simalign: High quality word alignments without parallel training data using static and contextualized embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1627–1643. Association for Computational Linguistics.
  46. Findings of the tsar-2022 shared task on multilingual lexical simplification. arXiv preprint arXiv:2302.02888.
  47. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  48. mgpt: Few-shot learners go multilingual. CoRR, abs/2204.07580.
  49. Multilingual neural machine translation with language clustering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 963–973. Association for Computational Linguistics.
  50. Charformer: Fast character transformers via gradient-based subword tokenization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  51. NLNDE at semeval-2023 task 12: Adaptive pretraining and source language selection for low-resource multilingual sentiment analysis. CoRR, abs/2305.00090.
  52. Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 833–844. Association for Computational Linguistics.
  53. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Trans. Assoc. Comput. Linguistics, 10:291–306.
  54. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 483–498. Association for Computational Linguistics.
  55. A study of conceptual language similarity: comparison and evaluation. CoRR, abs/2305.13401.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Peiqin Lin (15 papers)
  2. Chengzhi Hu (5 papers)
  3. Zheyu Zhang (23 papers)
  4. André F. T. Martins (113 papers)
  5. Hinrich Schütze (250 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com