Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic (2403.09017v3)

Published 14 Mar 2024 in cs.CL

Abstract: The swift progress and widespread acceptance of AI systems highlight a pressing requirement to comprehend both the capabilities and potential risks associated with AI. Given the linguistic complexity, cultural richness, and underrepresented status of Arabic in AI research, there is a pressing need to focus on LLMs performance and safety for Arabic-related tasks. Despite some progress in their development, there is a lack of comprehensive trustworthiness evaluation benchmarks, which presents a major challenge in accurately assessing and improving the safety of LLMs when prompted in Arabic. In this paper, we introduce AraTrust, the first comprehensive trustworthiness benchmark for LLMs in Arabic. AraTrust comprises 522 human-written multiple-choice questions addressing diverse dimensions related to truthfulness, ethics, safety, physical health, mental health, unfairness, illegal activities, privacy, and offensive language. We evaluated a set of LLMs against our benchmark to assess their trustworthiness. GPT-4 was the most trustworthy LLM, while open-source models, particularly AceGPT 7B and Jais 13B, struggled to achieve a score of 60% in our benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662.
  3. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
  4. Orca: A challenging benchmark for arabic language understanding.
  5. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  6. Acegpt, localizing large language models in arabic. arXiv preprint arXiv:2309.12053.
  7. GPTAraEval: A comprehensive evaluation of ChatGPT on Arabic NLP. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 220–247, Singapore. Association for Computational Linguistics.
  8. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
  9. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  10. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374.
  11. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6):1–35.
  12. How trustworthy are open-source llms? an assessment under malicious demonstrations shows their vulnerabilities. arXiv preprint arXiv:2311.09447.
  13. Emojis as anchors to detect arabic offensive language and hate speech. Natural Language Engineering, 29(6):1436–1457.
  14. Having beer after prayer? measuring cultural bias in large language models. arXiv preprint arXiv:2305.14456.
  15. OpenAI. 2022. Chatgpt.
  16. Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13.
  17. Cultural incongruencies in artificial intelligence. arXiv preprint arXiv:2211.13069.
  18. Alue: Arabic language understanding evaluation. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 173–184.
  19. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149.
  20. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561.
  21. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698.
  22. All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905.
  23. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387.
  24. Cvalues: Measuring the values of chinese large language models from safety to responsibility. arXiv preprint arXiv:2307.09705.
  25. Sc-safety: A multi-round open-ended question adversarial safety benchmark for large language models in chinese. arXiv preprint arXiv:2310.05818.
  26. Synthbio: A case study in human-ai collaborative curation of text datasets. arXiv preprint arXiv:2111.06467.
  27. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045.
  28. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Emad A. Alghamdi (9 papers)
  2. Reem I. Masoud (3 papers)
  3. Deema Alnuhait (5 papers)
  4. Afnan Y. Alomairi (1 paper)
  5. Ahmed Ashraf (8 papers)
  6. Mohamed Zaytoon (1 paper)
Citations (4)