Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models (2306.16322v1)
Abstract: LLMs have demonstrated impressive performance on various downstream tasks without requiring fine-tuning, including ChatGPT, a chat-based model built on top of LLMs such as GPT-3.5 and GPT-4. Despite having a lower training proportion compared to English, these models also exhibit remarkable capabilities in other languages. In this study, we assess the performance of GPT-3.5 and GPT-4 models on seven distinct Arabic NLP tasks: sentiment analysis, translation, transliteration, paraphrasing, part of speech tagging, summarization, and diacritization. Our findings reveal that GPT-4 outperforms GPT-3.5 on five out of the seven tasks. Furthermore, we conduct an extensive analysis of the sentiment analysis task, providing insights into how LLMs achieve exceptional results on a challenging dialectal dataset. Additionally, we introduce a new Python interface https://github.com/ARBML/Taqyim that facilitates the evaluation of these tasks effortlessly.
- Benchmarking arabic ai with large language models. arXiv preprint arXiv:2305.14982.
- Arbert & marbert: deep bidirectional transformers for arabic. arXiv preprint arXiv:2101.01785.
- Aramus: Pushing the limits of data and model scale for arabic natural language processing. arXiv preprint arXiv:2306.06800.
- Building arabic paraphrasing benchmark based on transformation rules. pages 1–17.
- Deep diacritization: Efficient hierarchical recurrence for improved Arabic diacritization. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 38–48, Barcelona, Spain (Online). Association for Computational Linguistics.
- Arabic tweets sentimental analysis using machine learning. In Advances in Artificial Intelligence: From Theory to Practice: 30th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2017, Arras, France, June 27-30, 2017, Proceedings, Part I 30, pages 602–610. Springer.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
- Transliteration of Arabizi into Arabic orthography: Developing a parallel annotated Arabizi-Arabic script SMS/chat corpus. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pages 93–103, Doha, Qatar. Association for Computational Linguistics.
- On the opportunities and risks of foundation models. ArXiv, abs/2108.07258.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Evaluating the feasibility of chatgpt in healthcare: An analysis of multiple clinical and research scenarios. Journal of Medical Systems, 47.
- Chatgpt evaluation on sentence level relations: A focus on temporal, causal, and discourse relations. arXiv preprint arXiv:2304.14827.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Arabic diacritization: Stats, rules, and hacks. In Proceedings of the Third Arabic Natural Language Processing Workshop, pages 9–17, Valencia, Spain. Association for Computational Linguistics.
- Davis, E. (2023). Mathematics, word problems, common sense, and artificial intelligence. arXiv preprint arXiv:2301.09723.
- Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805.
- Using mechanical turk to create a corpus of arabic summaries.
- Arat5: Text-to-text transformers for arabic language generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 628–647.
- Orca: A challenging benchmark for arabic language understanding. ArXiv, abs/2212.10758.
- Mathematical capabilities of chatgpt. arXiv preprint arXiv:2301.13867.
- Revisiting pre-trained language models and their evaluation for arabic natural language processing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3135–3151.
- How well does chatgpt do when taking the medical licensing exams? the implications of large language models for medical education and knowledge assessment. medRxiv, pages 2022–12.
- How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597.
- How good are gpt models at machine translation? a comprehensive evaluation.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745.
- 75 languages, 1 model: Parsing universal dependencies universally. arXiv preprint arXiv:1904.02099.
- Wikilingua: A new benchmark dataset for cross-lingual abstractive summarization. arXiv preprint arXiv:2010.03093.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Summary of chatgpt/gpt-4 research and perspective towards the future of large language models.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Trip: Triangular document-level pre-training for multilingual language models. arXiv preprint arXiv:2212.07752.
- Highly effective Arabic diacritization using sequence to sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2390–2395, Minneapolis, Minnesota. Association for Computational Linguistics.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
- A unified model for arabizi detection and transliteration using sequence-to-sequence models. In Proceedings of the fifth arabic natural language processing workshop, pages 167–177.
- Prague arabic dependency treebank: A word on the million words. In Proceedings of the workshop on Arabic and local languages (LREC 2008), pages 16–23.
- Gptaraeval: A comprehensive evaluation of chatgpt on arabic nlp. arXiv e-prints, pages arXiv–2305.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pages 22964–22984. PMLR.
- Universal dependencies 2.7. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- The united nations parallel corpus v1. 0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3530–3534.
- Arabic diacritic restoration approach based on maximum entropy models. Computer Speech & Language, 23:257–276.
- Šlapeta, J. (2023). Are chatgpt and other pretrained language models good parasitologists? Trends in Parasitology, 39(5):314–316.
- Zaid Alyafeai (21 papers)
- Maged S. Alshaibani (2 papers)
- Badr AlKhamissi (24 papers)
- Hamzah Luqman (12 papers)
- Ebrahim Alareqi (3 papers)
- Ali Fadel (5 papers)