Benchmarking LLMs for Persian: A Preliminary Study Focusing on ChatGPT
Overview
The paper "Benchmarking LLMs for Persian: A Preliminary Study Focusing on ChatGPT" provides a detailed evaluation of LLMs specifically in the context of the Persian language. The paper primarily focuses on OpenAI's GPT-3.5-turbo, GPT-4, and an open-source model, OpenChat-3.5. Various benchmarks are established, categorized into classic NLP tasks, reasoning tasks, and knowledge-based tasks. The authors ensure thoroughness by employing both Persian and English prompts in zero-shot, one-shot, and few-shot configurations to obtain a comprehensive understanding of the efficacy of these LLMs.
Key Findings
- Classic NLP Tasks:
- Sentiment Analysis: GPT-4 achieved a peak macro F1-score of 0.906 in the three-shot setting with English prompts, surpassing the fine-tuned mt5-base model (F1 0.891). However, GPT-3.5's performance plateaued with more demonstrations.
- Emotion Recognition: The highest F1-score for GPT-4 was .621, lower than the fine-tuned ParsBERT model (F1 0.699). GPT models showed modest performance, indicating room for improvement in this domain.
- Named Entity Recognition (NER): GPT-4 achieved a top F1-score of 0.712, compared to the SOTA F1 of 0.988, revealing challenges in recognizing named entities accurately.
- Machine Translation (MT): GPT-4 performed best in English to Persian translation, with an 8.7 BLEU score in the three-shot setting with English prompts, whereas Persian to English translation saw the SOTA (BLEU 11.7) outperforming the LLMs.
- Reading Comprehension: GPT-4 achieved an F1-score of 0.687, slightly below the SOTA score of 0.691. The models showed a marked improvement with few-shot prompts compared to zero-shot settings.
- Reasoning Tasks:
- Textual Entailment: Using the ParsiNLU dataset, GPT-4 achieved an F1-score of 0.636 in a three-shot English prompt setting, while SOTA was 0.690. For the ConjNLI dataset, GPT-4 attained an F1-score of 0.512, short of the SOTA (0.524).
- Multiple-choice QA (Math & Logic): GPT-4 displayed strong reasoning capabilities with an accuracy of 0.725 in a three-shot English setting, outperforming SOTA (0.395).
- Elementary School QA: GPT-4 achieved an accuracy of 0.740, indicating it is proficient in simpler reasoning tasks.
- Math Problems: GPT-4 led with a 0.564 math-solving accuracy in a three-shot English prompt setting, indicating robust mathematical reasoning capabilities.
- Knowledge-based Tasks:
- Literature Knowledge: GPT-4 exhibited an accuracy of 0.485, but GPT-3.5's performance (0.310) highlighted limitations in domain-specific knowledge.
- Common Knowledge: GPT-4’s accuracy of 0.635 outperformed both GPT-3.5 and SOTA models, showcasing its general knowledge proficiency.
Implications and Future Directions
The evaluation underscores the variance in capabilities of LLMs like GPT-4, GPT-3.5, and OpenChat-3.5 across different tasks in Persian NLP. GPT-4 consistently outperformed GPT-3.5 and OpenChat-3.5, indicating its superior generalization ability and robustness. OpenChat-3.5, although an open-source model with a smaller parameter count, showed competitiveness in certain tasks, suggesting potential for further improvements through more targeted optimizations.
The paper highlights specific areas where LLMs underperform, particularly in tasks like Named Entity Recognition and domain-specific knowledge tasks like literature. These results suggest an opportunity for fine-tuning models with more Persian-specific data to bridge these gaps.
The findings also revealed that performance was generally better when tasks used English prompts, even when test data were in Persian. This insight invites further investigation into the mechanics of multilingual model training and might encourage the development of more sophisticated prompt engineering techniques tailored to low-resource languages.
Conclusion
This preliminary benchmarking of LLMs for Persian reveals both the promise and limitations of current models like ChatGPT and OpenChat-3.5. The paper provides essential insights into the models' performance across various linguistic tasks and sets the stage for future research to enhance LLM performance for low-resource languages. Continued development and evaluation will likely focus on addressing identified weaknesses and extending these models to be more effective and accurate in non-English contexts.
These efforts will be instrumental in fostering broader, more inclusive AI development, where LLMs can serve diverse linguistic communities with equal proficiency. Future work could also explore integration with other LLM advancements, emphasizing fine-tuning and prompt engineering, specifically catering to low-resource languages like Persian.