PersianMind: A Cross-Lingual Persian-English Large Language Model (2401.06466v1)
Abstract: LLMs demonstrate remarkable proficiency in various linguistic tasks and have extensive knowledge across various domains. Although they perform best in English, their ability in other languages is notable too. In contrast, open-source models, such as LLaMa, are primarily trained on English datasets, resulting in poor performance in non-English languages. In this paper, we introduce PersianMind, an open-source bilingual LLM which demonstrates comparable performance to closed-source GPT-3.5-turbo in the Persian language. By expanding LLaMa2's vocabulary with 10,000 Persian tokens and training it on a dataset comprising nearly 2 billion Persian tokens, we show that our approach preserves the model's English knowledge and employs transfer learning to excel at transferring task knowledge from one language to another.
- “ParSQuAD: Persian Question Answering Dataset based on Machine Translation of SQuAD 2.0” In International Journal of Web Research, 2021
- “Palm 2 technical report” In arXiv preprint arXiv:2305.10403, 2023
- Anthropic “Claude” https://claude.ai [Accessed: 01-07-2024], 2023
- Sajjad Ayoubi and Mohammad Yasin Davoodeh “PersianQA: a dataset for Persian Question Answering” In GitHub repository GitHub, https://github.com/SajjjadAyobi/PersianQA, 2021
- “The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants” In arXiv preprint arXiv:2308.16884, 2023
- Elad Ben Zaken, Yoav Goldberg and Shauli Ravfogel “BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022
- “Sparks of artificial general intelligence: Early experiments with GPT-4” In arXiv preprint arXiv:2303.12712, 2023
- “Unsupervised Cross-lingual Representation Learning at Scale” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
- Courtney D Corley and Rada Mihalcea “Measuring the semantic similarity of texts” In Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment, 2005, pp. 13–18
- “No language left behind: Scaling human-centered machine translation” In arXiv preprint arXiv:2207.04672, 2022
- “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019
- Ember “Global Electricity Review 2023” https://ember-climate.org/insights/research/global-electricity-review-2023/ [Accessed: 01-07-2024], 2023
- “ParsBERT: Transformer-based Model for Persian Language Understanding” In Neural Processing Letters, 2021
- Mehrdad Farahani, Mohammad Gharachorloo and M. Manthouri “Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization” In 2021 26th International Computer Conference, Computer Society of Iran (CSICC), 2021
- “Language-agnostic BERT Sentence Embedding” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022
- Zahra Ghasemi and Mohammad Ali Keyvanrad “FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset” In International Conference on Computer Engineering and Knowledge, 2021
- “SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency of Adapters” In Findings of the Association for Computational Linguistics, 2022
- Kevin Heffernan, Onur Çelebi and Holger Schwenk “Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages” In Findings of the Association for Computational Linguistics: EMNLP, 2022
- “Measuring Massive Multitask Language Understanding” In Proceedings of the International Conference on Learning Representations (ICLR), 2021
- “Parameter-efficient transfer learning for NLP” In International Conference on Machine Learning, 2019
- “LoRA: Low-Rank Adaptation of Large Language Models” In International Conference on Learning Representations, 2022
- “ParsiNLU: A Suite of Language Understanding Challenges for Persian” In Transactions of the Association for Computational Linguistics, 2021
- “Bactrian-X: A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation” In arXiv preprint arXiv:2305.15011, 2023
- “AnglE-optimized Text Embeddings” In arXiv preprint arXiv:2309.12871, 2023
- Vladislav Lialin, Vijeta Deshpande and Anna Rumshisky “Scaling down to scale up: A guide to parameter-efficient fine-tuning” In arXiv preprint arXiv:2303.15647, 2023
- “XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
- MosaicML “MPT” https://www.mosaicml.com/blog/mpt-7b [Accessed: 01-07-2024], 2023
- “MTEB: Massive Text Embedding Benchmark” In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023
- “Orca: Progressive learning from complex explanation traces of gpt-4” In arXiv preprint arXiv:2306.02707, 2023
- OpenAI “ChatGPT” https://openai.com/blog/chatgpt [Accessed: 01-07-2024], 2022
- “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
- “The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only” In arXiv preprint arXiv:2306.01116, 2023
- Matt Post “A Call for Clarity in Reporting BLEU Scores” In Proceedings of the Third Conference on Machine Translation: Research Papers, 2018
- “COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task” In Proceedings of the Seventh Conference on Machine Translation, 2022
- “Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020
- “WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia” In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 2021
- Rico Sennrich, Barry Haddow and Alexandra Birch “Neural Machine Translation of Rare Words with Subword Units” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016
- Yi-Lin Sung, Varun Nair and Colin A Raffel “Training neural networks with fixed sparse masks” In Advances in Neural Information Processing Systems, 2021
- “Stanford Alpaca: An Instruction-following LLaMA Model” In GitHub repository GitHub, https://github.com/tatsu-lab/stanford_alpaca, 2023
- Tii “Falcon” https://falconllm.tii.ae/falcon.html [Accessed: 01-07-2024], 2023
- “Llama: Open and efficient foundation language models” In arXiv preprint arXiv:2302.13971, 2023
- “Llama 2: Open foundation and fine-tuned chat models” In arXiv preprint arXiv:2307.09288, 2023
- “Attention is all you need” In Advances in neural information processing systems, 2017
- “Efficient fine-tuning of bert models on the edge” In 2022 IEEE International Symposium on Circuits and Systems, 2022
- “AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
- “Sustainable AI: Environmental implications, challenges and opportunities” In Proceedings of Machine Learning and Systems, 2022
- “Visual chatgpt: Talking, drawing and editing with visual foundation models” In arXiv preprint arXiv:2303.04671, 2023
- “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021
- Yi “01” https://01.ai/ [Accessed: 01-07-2024], 2023
- “A survey of large language models” In arXiv preprint arXiv:2303.18223, 2023
- Pedram Rostami (3 papers)
- Ali Salemi (1 paper)
- Mohammad Javad Dousti (17 papers)