Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ParsBERT: Transformer-based Model for Persian Language Understanding (2005.12515v2)

Published 26 May 2020 in cs.CL

Abstract: The surge of pre-trained LLMs has begun a new era in the field of NLP by allowing us to build powerful LLMs. Among these models, Transformer-based models such as BERT have become increasingly popular due to their state-of-the-art performance. However, these models are usually focused on English, leaving other languages to multilingual models with limited resources. This paper proposes a monolingual BERT for the Persian language (ParsBERT), which shows its state-of-the-art performance compared to other architectures and multilingual models. Also, since the amount of data available for NLP tasks in Persian is very restricted, a massive dataset for different NLP tasks as well as pre-training the model is composed. ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones and improves the state-of-the-art performance by outperforming both multilingual BERT and other prior works in Sentiment Analysis, Text Classification and Named Entity Recognition tasks.

ParsBERT: Transformer-based Model for Persian Language Understanding

The paper under review presents "ParsBERT," a monolingual BERT model tailored explicitly for the Persian language, addressing a critical gap in NLP tools available for non-English languages. The authors, Farahani et al., highlight the limitations of existing multilingual models in effectively handling the Persian language, motivating the development of a dedicated transformer-based framework to achieve superior performance in various NLP tasks.

Model Overview

ParsBERT is derived from the BERT architecture, leveraging a multi-layer bidirectional Transformer design. The model configuration closely follows the BERT_BASE setup, including 12 hidden layers and 12 attention heads, totaling 110 million parameters. The pre-training of ParsBERT employs two primary objectives: Masked LLMing (MLM) and Next Sentence Prediction (NSP). These tasks are central in facilitating the acquisition of contextual representations crucial for text understanding.

Corpus and Pre-processing

A significant contribution of this work is the creation of a comprehensive Persian textual corpus, approximately 14GB in size, curated from diverse sources. This corpus addresses the shortcomings of existing datasets, which are either limited in scope or of inadequate linguistic quality. Rigorous pre-processing and normalization techniques are applied to ensure high data quality, including tokenization using the WordPiece method, which resulted in a vocabulary size of 100K tokens. A notable distinction of their dataset preparation is the focus on preserving True Sentences to maintain semantic coherence during pre-training.

Evaluation and Performance

ParsBERT's performance is evaluated across three critical downstream NLP tasks: Sentiment Analysis, Text Classification, and Named Entity Recognition (NER). The results demonstrate significant improvements over previous models:

  1. Sentiment Analysis: ParsBERT outperforms multilingual BERT and other custom architectures across three datasets, including Digikala, SnappFood, and DeepSentiPers, with notable gains in F1 scores.
  2. Text Classification: Utilizing datasets collected from Digikala Online Magazine and Persian news sources, ParsBERT achieves superior accuracy and F1 scores compared to multilingual BERT, underscoring its effectiveness in capturing nuanced linguistic patterns.
  3. Named Entity Recognition: ParsBERT sets new benchmarks for NER tasks using PEYMA and ARMAN datasets, surpassing existing models such as Beheshti-NER by substantial margins.

Implications and Future Directions

The introduction of ParsBERT marks a pivotal step in enhancing NLP capabilities for the Persian language. Its success underscores the importance of language-specific models that can substantially outperform their multilingual counterparts when adequately resourced. This research advocates for continued investment in building and refining LLMs tailored for low-resource languages, ensuring more inclusive and reliable NLP technologies globally.

Looking forward, future work could focus on extending ParsBERT to accommodate emerging NLP tasks, further expanding the linguistic corpus to encompass dialectal variations, and optimizing computational efficiency. Additionally, the integration of ParsBERT into multi-modal applications could broaden the scope of its utility, offering a robust foundation for interdisciplinary research across AI and computational linguistics.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mehrdad Farahani (6 papers)
  2. Mohammad Gharachorloo (2 papers)
  3. Marzieh Farahani (2 papers)
  4. Mohammad Manthouri (17 papers)
Citations (180)