Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PersianMind: A Cross-Lingual Persian-English Large Language Model (2401.06466v1)

Published 12 Jan 2024 in cs.CL and cs.AI

Abstract: LLMs demonstrate remarkable proficiency in various linguistic tasks and have extensive knowledge across various domains. Although they perform best in English, their ability in other languages is notable too. In contrast, open-source models, such as LLaMa, are primarily trained on English datasets, resulting in poor performance in non-English languages. In this paper, we introduce PersianMind, an open-source bilingual LLM which demonstrates comparable performance to closed-source GPT-3.5-turbo in the Persian language. By expanding LLaMa2's vocabulary with 10,000 Persian tokens and training it on a dataset comprising nearly 2 billion Persian tokens, we show that our approach preserves the model's English knowledge and employs transfer learning to excel at transferring task knowledge from one language to another.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. “ParSQuAD: Persian Question Answering Dataset based on Machine Translation of SQuAD 2.0” In International Journal of Web Research, 2021
  2. “Palm 2 technical report” In arXiv preprint arXiv:2305.10403, 2023
  3. Anthropic “Claude” https://claude.ai [Accessed: 01-07-2024], 2023
  4. Sajjad Ayoubi and Mohammad Yasin Davoodeh “PersianQA: a dataset for Persian Question Answering” In GitHub repository GitHub, https://github.com/SajjjadAyobi/PersianQA, 2021
  5. “The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants” In arXiv preprint arXiv:2308.16884, 2023
  6. Elad Ben Zaken, Yoav Goldberg and Shauli Ravfogel “BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022
  7. “Sparks of artificial general intelligence: Early experiments with GPT-4” In arXiv preprint arXiv:2303.12712, 2023
  8. “Unsupervised Cross-lingual Representation Learning at Scale” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
  9. Courtney D Corley and Rada Mihalcea “Measuring the semantic similarity of texts” In Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment, 2005, pp. 13–18
  10. “No language left behind: Scaling human-centered machine translation” In arXiv preprint arXiv:2207.04672, 2022
  11. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019
  12. Ember “Global Electricity Review 2023” https://ember-climate.org/insights/research/global-electricity-review-2023/ [Accessed: 01-07-2024], 2023
  13. “ParsBERT: Transformer-based Model for Persian Language Understanding” In Neural Processing Letters, 2021
  14. Mehrdad Farahani, Mohammad Gharachorloo and M. Manthouri “Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization” In 2021 26th International Computer Conference, Computer Society of Iran (CSICC), 2021
  15. “Language-agnostic BERT Sentence Embedding” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022
  16. Zahra Ghasemi and Mohammad Ali Keyvanrad “FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset” In International Conference on Computer Engineering and Knowledge, 2021
  17. “SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency of Adapters” In Findings of the Association for Computational Linguistics, 2022
  18. Kevin Heffernan, Onur Çelebi and Holger Schwenk “Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages” In Findings of the Association for Computational Linguistics: EMNLP, 2022
  19. “Measuring Massive Multitask Language Understanding” In Proceedings of the International Conference on Learning Representations (ICLR), 2021
  20. “Parameter-efficient transfer learning for NLP” In International Conference on Machine Learning, 2019
  21. “LoRA: Low-Rank Adaptation of Large Language Models” In International Conference on Learning Representations, 2022
  22. “ParsiNLU: A Suite of Language Understanding Challenges for Persian” In Transactions of the Association for Computational Linguistics, 2021
  23. “Bactrian-X: A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation” In arXiv preprint arXiv:2305.15011, 2023
  24. “AnglE-optimized Text Embeddings” In arXiv preprint arXiv:2309.12871, 2023
  25. Vladislav Lialin, Vijeta Deshpande and Anna Rumshisky “Scaling down to scale up: A guide to parameter-efficient fine-tuning” In arXiv preprint arXiv:2303.15647, 2023
  26. “XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
  27. MosaicML “MPT” https://www.mosaicml.com/blog/mpt-7b [Accessed: 01-07-2024], 2023
  28. “MTEB: Massive Text Embedding Benchmark” In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023
  29. “Orca: Progressive learning from complex explanation traces of gpt-4” In arXiv preprint arXiv:2306.02707, 2023
  30. OpenAI “ChatGPT” https://openai.com/blog/chatgpt [Accessed: 01-07-2024], 2022
  31. “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
  32. “The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only” In arXiv preprint arXiv:2306.01116, 2023
  33. Matt Post “A Call for Clarity in Reporting BLEU Scores” In Proceedings of the Third Conference on Machine Translation: Research Papers, 2018
  34. “COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task” In Proceedings of the Seventh Conference on Machine Translation, 2022
  35. “Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020
  36. “WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia” In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 2021
  37. Rico Sennrich, Barry Haddow and Alexandra Birch “Neural Machine Translation of Rare Words with Subword Units” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016
  38. Yi-Lin Sung, Varun Nair and Colin A Raffel “Training neural networks with fixed sparse masks” In Advances in Neural Information Processing Systems, 2021
  39. “Stanford Alpaca: An Instruction-following LLaMA Model” In GitHub repository GitHub, https://github.com/tatsu-lab/stanford_alpaca, 2023
  40. Tii “Falcon” https://falconllm.tii.ae/falcon.html [Accessed: 01-07-2024], 2023
  41. “Llama: Open and efficient foundation language models” In arXiv preprint arXiv:2302.13971, 2023
  42. “Llama 2: Open foundation and fine-tuned chat models” In arXiv preprint arXiv:2307.09288, 2023
  43. “Attention is all you need” In Advances in neural information processing systems, 2017
  44. “Efficient fine-tuning of bert models on the edge” In 2022 IEEE International Symposium on Circuits and Systems, 2022
  45. “AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
  46. “Sustainable AI: Environmental implications, challenges and opportunities” In Proceedings of Machine Learning and Systems, 2022
  47. “Visual chatgpt: Talking, drawing and editing with visual foundation models” In arXiv preprint arXiv:2303.04671, 2023
  48. “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021
  49. Yi “01” https://01.ai/ [Accessed: 01-07-2024], 2023
  50. “A survey of large language models” In arXiv preprint arXiv:2303.18223, 2023
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Pedram Rostami (3 papers)
  2. Ali Salemi (1 paper)
  3. Mohammad Javad Dousti (17 papers)
Citations (4)