Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts (2407.02320v1)

Published 2 Jul 2024 in cs.CL and cs.AI

Abstract: Decoder-only LLMs excel in high-resource languages across various tasks through few-shot or even zero-shot in-context learning (ICL). However, their performance often does not transfer well to low-resource languages, especially those written in non-Latin scripts. Inspired by recent work that leverages transliteration in encoder-only models, we investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts. To this end, we propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. We apply these methods to several representative LLMs of different sizes on various tasks including text classification and sequential labeling. Our findings show that the effectiveness of transliteration varies by task type and model size. For instance, all models benefit from transliterations for sequential labeling (with increases of up to 25%).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. SIB-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 226–245, St. Julian’s, Malta. Association for Computational Linguistics.
  2. Revisiting machine translation for cross-lingual classification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6489–6499, Singapore. Association for Computational Linguistics.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  4. Monolingual or multilingual instruction tuning: Which makes a better alpaca. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1347–1356, St. Julian’s, Malta. Association for Computational Linguistics.
  5. Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8584–8595, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  6. Do multilingual language models think better in english? Preprint, arXiv:2308.01223.
  7. Out-of-the-box universal Romanization tool uroman. In Proceedings of ACL 2018, System Demonstrations, pages 13–18, Melbourne, Australia. Association for Computational Linguistics.
  8. Understanding cross-lingual alignment – a survey. Preprint, arXiv:2404.06228.
  9. Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1082–1117, Toronto, Canada. Association for Computational Linguistics.
  10. Mixtral of experts. Preprint, arXiv:2401.04088.
  11. Crosslingual retrieval augmented in-context learning for Bangla. In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 136–151, Singapore. Association for Computational Linguistics.
  12. Mala-500: Massive language adaptation of large language models. Preprint, arXiv:2401.13303.
  13. Few-shot learning with multilingual language models. Preprint, arXiv:2112.10668.
  14. Translico: A contrastive learning framework to address the script barrier in multilingual pretrained language models. Preprint, arXiv:2401.06620.
  15. Transmi: A framework to create strong baselines from multilingual pretrained language models for transliterated data. Preprint, arXiv:2405.09913.
  16. Taxi1500: A multilingual dataset for text classification in 1500 languages. Preprint, arXiv:2305.08487.
  17. Does transliteration help multilingual language modeling? In Findings of the Association for Computational Linguistics: EACL 2023, pages 670–685, Dubrovnik, Croatia. Association for Computational Linguistics.
  18. Cross-lingual retrieval augmented prompt for low-resource languages. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8320–8340, Toronto, Canada. Association for Computational Linguistics.
  19. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
  20. Romanization-based large-scale adaptation of multilingual language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7996–8005, Singapore. Association for Computational Linguistics.
  21. Bloom: A 176b-parameter open-access multilingual language model. arXiv.
  22. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  23. No language left behind: Scaling human-centered machine translation. Preprint, arXiv:2207.04672.
  24. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  25. Retrieval-augmented multilingual knowledge editing. Preprint, arXiv:2312.13040.
  26. A survey of large language models. Preprint, arXiv:2303.18223.
  27. Aya model: An instruction finetuned open-access multilingual language model. Preprint, arXiv:2402.07827.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chunlan Ma (20 papers)
  2. Yihong Liu (25 papers)
  3. Haotian Ye (39 papers)
  4. Hinrich Schütze (250 papers)
Citations (1)