Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models (2308.16149v2)

Published 30 Aug 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative LLMs. The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than any existing open Arabic and multilingual models by a sizable margin, based on extensive evaluation. Moreover, the models are competitive in English compared to English-centric open models of similar size, despite being trained on much less English data. We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models. We release two open versions of the model -- the foundation Jais model, and an instruction-tuned Jais-chat variant -- with the aim of promoting research on Arabic LLMs. Available at https://huggingface.co/inception-mbzuai/jais-13b-chat

Definition Search Book Streamline Icon: https://streamlinehq.com
References (122)
  1. Falcon-40B: an open large language model with state-of-the-art performance. Technical report, Technology Innovation Institute, 2023.
  2. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pages 9–15, Marseille, France, 2020.
  3. AraELECTRA: Pre-training text discriminators for Arabic language understanding. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, WANLP, pages 191–195, Kyiv, Ukraine (Virtual), 2021.
  4. AraGPT2: Pre-trained transformer for Arabic language generation. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, WANLP, pages 196–207, Kyiv, Ukraine (Virtual), 2021.
  5. Ibrahim Abu El-Khair. Abu El-Khair Corpus: A modern standard Arabic corpus. International Journal of Recent Trends in Engineering & Research, 2:5–13, 11 2016.
  6. Pre-training BERT on Arabic tweets: Practical considerations. arXiv preprint arXiv:2102.10684, 2021.
  7. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP, pages 7088–7105, Online, 2021.
  8. GPT4All: Training an assistant-style chatbot with large scale data distillation from GPT-3.5-Turbo. https://github.com/nomic-ai/gpt4all, 2023.
  9. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  10. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, June 1990.
  11. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  12. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  13. ANERsys: An Arabic named entity recognition system based on maximum entropy. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, pages 143–153, Berlin, Heidelberg, 2007.
  14. PIQA: reasoning about physical commonsense in natural language. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI, pages 7432–7439, New York, NY, USA, 2020.
  15. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  16. Towards a map of the syntactic similarity of languages. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, pages 576–590, Budapest, Hungary, 2017.
  17. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  18. Free Dolly: Introducing the world’s first truly open instruction-tuned LLM. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm, 2023.
  19. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, pages 8440–8451, Online, 2020.
  20. Cross-lingual language model pretraining. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, pages 7057–7067, Vancouver, BC, Canada, 2019.
  21. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 2924–2936, Minneapolis, MN, USA, 2019.
  22. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, March 2023.
  23. ELECTRA: pre-training text encoders as discriminators rather than generators. In Proceedings of the 8th International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia, 2020.
  24. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  25. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  26. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 4171–4186, Minneapolis, MN, USA, 2019.
  27. More effective boilerplate removal-the GoldMiner algorithm. Polibits, 48:79–83, 12 2013.
  28. AraBart: a pretrained Arabic sequence-to-sequence model for abstractive summarization. arXiv preprint arXiv:2203.10945, 2022.
  29. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459, 2023.
  30. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  31. OpenWebTextCorpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  32. Building large monolingual dictionaries at the Leipzig corpora collection: From 100 to 200 languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC, pages 759–765, Istanbul, Turkey, 2012.
  33. IDAT at FIRE2019: Overview of the track on irony detection in Arabic tweets. In Proceedings of the 11th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE, pages 10–13, 2019.
  34. Human-like summarization evaluation with ChatGPT. arXiv preprint arXiv:2304.02554, 2023.
  35. A framework for few-shot language model evaluation v0.0.1. https://doi.org/10.5281/zenodo.5371628, September 2021.
  36. Hybridity in MT: Experiments on the Europarl corpus. In Proceedings of the 11th Annual conference of the European Association for Machine Translation, EAMT, Oslo, Norway, 2006.
  37. JABER and SABER: Junior and senior Arabic BERT. arXiv preprint arXiv:2112.04329, 2022.
  38. How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. arXiv preprint arXiv: 2301.07597, 2023.
  39. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2022.
  40. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  41. EXAMS: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 5427–5444, Online, 2020.
  42. Unnatural Instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, ACL, pages 14409–14428, Toronto, Canada, 2023.
  43. The interplay of variant, size, and task type in Arabic pre-trained language models. arXiv preprint arXiv:2103.06678, 2021.
  44. AraBART: a pretrained Arabic sequence-to-sequence model for abstractive summarization. In Proceedings of the Seventh Arabic Natural Language Processing Workshop, WANLP, pages 31–42, Abu Dhabi, United Arab Emirates, 2022.
  45. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  46. Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the Machine Translation summit, volume 5, pages 79–86, 2005.
  47. Natural Questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  48. The Enron corpus: A new dataset for email classification research. In Proceedings of the European Conference on Machine Learning, ECML, pages 217–226, Pisa, Italy, 2004.
  49. Tokenization impacts multilingual language modeling: Assessing vocabulary allocation and overlap across languages. In Findings of the Association for Computational Linguistics, ACL, pages 5661–5681, Toronto, Canada, 2023.
  50. An empirical study of pre-trained transformers for Arabic information extraction. In Proceedings of the 2020 Conference on Empirical Methods on Natural Language Processing, EMNLP, pages 4727–4734, Online, 2020.
  51. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations, ICLR, Vancouver, VC, Canada, 2018.
  52. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, ACL, pages 3214–3252, Dublin, Ireland, 2022.
  53. Bactrian-X: A multilingual replicable instruction-following model with low-rank adaptation. arXiv preprint arXiv:2305.15011, 2023.
  54. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  55. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  56. ChatGPT as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint arXiv:2303.15621, 2023.
  57. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 785–794, Copenhagen, Denmark, 2017.
  58. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv: 2306.09212, 2023.
  59. SemEval-2018 task 1: Affect in tweets. In Proceedings of the 12th International Workshop on Semantic Evaluation, SemEval, pages 1–17, New Orleans, Louisiana, 2018.
  60. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 2381–2391, Brussels, Belgium, 2018.
  61. Overview of OSACT4 Arabic offensive language detection shared task. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pages 48–52, Marseille, France, 2020.
  62. Quantifying privacy risks of masked language models using membership inference attacks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 8332–8347, Abu Dhabi, United Arab Emirates, 2022.
  63. Self-Refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  64. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2023.
  65. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229, 2019.
  66. JASMINE: Arabic GPT models for few-shot learning. arXiv preprint arXiv:2212.10755, 2022.
  67. AraT5: Text-to-text transformers for Arabic language generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, ACL, pages 628–647, Dublin, Ireland, 2022.
  68. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 1953–1967, Online, 2020.
  69. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), NAACL, pages 48–53, Minneapolis, MN, USA, 2019.
  70. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  71. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  72. CAMeL tools: An open source python toolkit for Arabic natural language processing. In Proceedings of the Twelfth Language Resources and Evaluation Conference, LREC, pages 7022–7032, Marseille, France, 2020.
  73. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023.
  74. Language model tokenizers introduce unfairness between languages. arXiv preprint arXiv:2305.15425, 2023.
  75. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  76. Shawn Presser. Books3. https://twitter.com/theshawwn/status/1320282149329784833, 2020.
  77. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ACL, pages 311–318, Philadelphia, PA, USA, 2002.
  78. Train short, test long: Attention with linear biases enables input length extrapolation. In Proceedings of the International Conference on Learning Representations, ICLR, Online, 2022.
  79. Zheng Lin Qingyi Si. Alpaca-CoT: An instruction fine-tuning platform with instruction data collection and unified large language models interface. https://github.com/PhoebusSi/alpaca-CoT, 2023.
  80. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446, 2022.
  81. Improving language understanding by generative pre-training. OpenAI, 2018.
  82. Compressive transformers for long-range sequence modelling. In Proceedings of the International Conference on Learning Representations, ICLR, Online, 2020.
  83. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  84. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
  85. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
  86. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022.
  87. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval, pages 2054–2059, Barcelona, Spain (online), 2020.
  88. WinoGrande: An adversarial Winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  89. BLOOM: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2023.
  90. Analysing mathematical reasoning abilities of neural models. In Proceedings of the International Conference on Learning Representations, ICLR, New Orleans, LA, USA, 2019.
  91. Noam Shazeer. GLU variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  92. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL, pages 1715–1725, Berlin, Germany, 2016.
  93. ArabicWeb16: A new crawl for today’s Arabic web. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, page 673–676, Pisa, Italy, 2016.
  94. RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2022.
  95. ALUE: Arabic language understanding evaluation. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, WANLP, pages 173–184, Kyiv, Ukraine (Virtual), 2021.
  96. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
  97. On the safety of conversational models: Taxonomy, dataset, and benchmark. In Findings of the Association for Computational Linguistics, ACL, pages 3906–3923, Dublin, Ireland, 2022.
  98. Stanford Alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  99. Jörg Tiedemann. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC, pages 2214–2218, Istanbul, Turkey, 2012.
  100. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  101. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  102. Petter Törnberg. ChatGPT-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588, 2023.
  103. Hans Van Halteren. Source language markers in europarl translations. In Proceedings of the 22nd International Conference on Computational Linguistics, COLING, pages 937–944, Manchester, UK, 2008.
  104. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pages 5998–6008, Long Beach, CA, USA, 2017.
  105. Transformer-based architecture for empathy prediction and emotion classification. In Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis, pages 261–264, Dublin, Ireland, 2022.
  106. Style Over Substance: Evaluation biases for large language models. arXiv preprint arXiv:2307.03025, 2023.
  107. Self-Instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, ACL, pages 13484–13508, Toronto, ON, Canada, 2023.
  108. Do-Not-Answer: A dataset for evaluating safeguards in LLMs. arXiv preprint arXiv:2308.13387, 2023.
  109. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 5085–5109, Abu Dhabi, United Arab Emirates, 2022.
  110. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
  111. Chain-of-Thought prompting elicits reasoning in large language models. NeurIPS, New Orleans, LA, USA, 2022.
  112. Instruction in the wild: A user-based instruction dataset. https://github.com/XueFuzhao/InstructionWild, 2023.
  113. Tuning large neural networks via zero-shot hyperparameter transfer. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS, pages 17084–17097, Online, 2021.
  114. Privacy- and utility-preserving NLP with anonymized data: A case study of pseudonymization. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing, TrustNLP, pages 232–241, Toronto, ON, Canada, 2023.
  115. SituatedQA: Incorporating extra-linguistic contexts into QA. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 7371–7387, Punta Cana, Dominican Republic, 2021.
  116. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL, pages 4791–4800, Florence, Italy, 2019.
  117. The United Nations parallel corpus v1.0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC, pages 3530–3534, Portorož, Slovenia, 2016.
  118. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, ICCV, pages 19–27, Santiago, Chile, 2015.
  119. GLM-130B: an open bilingual pre-trained model. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 2023.
  120. Large language models are human-level prompt engineers. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 2023.
  121. Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 1415–1420, Minneapolis, MN, USA, 2019.
  122. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Citations (30)

Summary

We haven't generated a summary for this paper yet.