Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Lingual Transfer Robustness to Lower-Resource Languages on Adversarial Datasets (2403.20056v1)

Published 29 Mar 2024 in cs.CL

Abstract: Multilingual LLMs (MLLMs) exhibit robust cross-lingual transfer capabilities, or the ability to leverage information acquired in a source language and apply it to a target language. These capabilities find practical applications in well-established NLP tasks such as Named Entity Recognition (NER). This study aims to investigate the effectiveness of a source language when applied to a target language, particularly in the context of perturbing the input test set. We evaluate on 13 pairs of languages, each including one high-resource language (HRL) and one low-resource language (LRL) with a geographic, genetic, or borrowing relationship. We evaluate two well-known MLLMs--MBERT and XLM-R--on these pairs, in native LRL and cross-lingual transfer settings, in two tasks, under a set of different perturbations. Our findings indicate that NER cross-lingual transfer depends largely on the overlap of entity chunks. If a source and target language have more entities in common, the transfer ability is stronger. Models using cross-lingual transfer also appear to be somewhat more robust to certain perturbations of the input, perhaps indicating an ability to leverage stronger representations derived from the HRL. Our research provides valuable insights into cross-lingual transfer and its implications for NLP applications, and underscores the need to consider linguistic nuances and potential limitations when employing MLLMs across distinct languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Giuseppe Attardi. 2015. Wikiextractor. https://github.com/attardi/wikiextractor.
  2. Gabriel Bernier-Colborne and Philippe Langlais. 2020. Hardeval: Focusing on challenging tokens to assess robustness of ner. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1704–1711.
  3. Saisiyat is where it is at! insights into backdoors and debiasing of cross lingual transformers for named entity recognition. In 2022 IEEE International Conference on Big Data (Big Data), pages 2940–2949. IEEE.
  4. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56. IEEE.
  7. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323.
  8. Kyle Gorman and Steven Bedrick. 2019. We need to talk about standard splits. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2019, page 2786. NIH Public Access.
  9. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pages 4411–4421. PMLR.
  10. Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885.
  11. Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031.
  12. The state and fate of linguistic diversity and inclusion in the nlp world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293.
  13. Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4948–4961.
  14. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.
  15. On evaluation of adversarial perturbations for sequence-to-sequence models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3103–3114.
  16. Towards robust and semantically organised latent representations for unsupervised text style transfer. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 456–474, Seattle, United States. Association for Computational Linguistics.
  17. AxomiyaBERTa: A phonologically-aware transformer model for Assamese. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11629–11646, Toronto, Canada. Association for Computational Linguistics.
  18. A generalized method for automated multilingual loanword detection. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4996–5013.
  19. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958.
  20. Overlap-based vocabulary generation improves cross-lingual transfer among related languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 219–233, Dublin, Ireland. Association for Computational Linguistics.
  21. How multilingual is multilingual bert? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001.
  22. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  23. Semantically equivalent adversarial rules for debugging nlp models. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pages 856–865.
  24. Gottbert: a pure german language model. arXiv preprint arXiv:2012.02110.
  25. Holger Schwenk and Xian Li. 2018. A corpus for multilingual document classification in eight languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  26. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 1715. Association for Computational Linguistics.
  27. We need to talk about random splits. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1823–1832.
  28. Akshay Srinivasan and Sowmya Vajjala. 2023. A multilingual evaluation of ner robustness to adversarial inputs. arXiv preprint arXiv:2305.18933.
  29. Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2670–2680.
  30. Estbert: A pretrained language-specific bert for estonian. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 11–19.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  32. Asahi Ushio and Jose Camacho-Collados. 2021. T-ner: An all-round python library for transformer-based named entity recognition. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 53–62.
  33. Sowmya Vajjala and Ramya Balasubramaniam. 2022. What do we really know about state of the art ner? In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5983–5993.
  34. Rob van der Goot. 2021. We need to talk about train-dev-test splits. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4485–4494.
  35. Multilingual is not enough: Bert for finnish. arXiv preprint arXiv:1912.07076.
  36. Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering. Transactions of the Association for Computational Linguistics, 7:387–401.
  37. Huggingface’s transformers: State-of-the-art natural language processing. arXiv e-prints, pages arXiv–1910.
  38. Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of bert. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 833–844.
  39. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
  40. Generating natural adversarial examples. In International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shadi Manafi (2 papers)
  2. Nikhil Krishnaswamy (34 papers)