Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla (2101.00204v4)

Published 1 Jan 2021 in cs.CL

Abstract: In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP.

An Evaluation of BanglaBERT: Advancements in Bangla NLP

This paper introduces BanglaBERT, a BERT-based model specifically pre-trained for natural language understanding (NLU) in Bangla. Despite Bangla being the sixth most spoken language worldwide, it remains under-resourced in terms of NLP tools. This paper addresses the gap by assembling a comprehensive 27.5 GB dataset named 'Bangla2B+', extracted from 110 popular Bangla websites. The objective is to enhance the language processing capabilities for Bangla through pretraining tailored specifically to this language.

Contributions and Methodology

  1. Model Development: The authors present two models, BanglaBERT and its bilingual counterpart, BanglishBERT, also employing English data to facilitate zero-shot cross-lingual transfer learning. BanglaBERT utilizes the ELECTRA framework for training, capitalizing on the Replaced Token Detection (RTD) objective for efficient pretraining.
  2. Dataset and Benchmark Creation: They introduce new datasets for Bangla Natural Language Inference (NLI) and Question Answering (QA), and consolidate these with existing datasets into the Bangla Language Understanding Benchmark (BLUB). This marks the first Bangla-specific benchmark to assess model performance across text classification, sequence labeling, and span prediction tasks.
  3. Results: BanglaBERT delivers state-of-the-art results, outperforming both multilingual models like mBERT and XLM-R, as well as monolingual ones in supervised setting, achieving a 77.09 BLUB score. In zero-shot settings, BanglishBERT showed strong cross-lingual capabilities, rivaling XLM-R large despite its smaller size.

Implications

Practical Implementation: The availability of BanglaBERT coupled with its datasets represents a critical resource for Bangla NLP applications, fostering advancements in regional language technologies. This research provides a clear path for developing efficient, task-specific Bangla NLP tools in applications like sentiment analysis, entity recognition, and more.

Theoretical Insights: The work underscores the benefits of language-specific models over multilingual ones, particularly when low-resource languages are involved. It also presents an interesting case of leveraging bilingual models to cross-bind resource strengths across languages effectively using cross-lingual transfer learning.

Future Directions: Moving forward, efforts can be made to extend the BLUB benchmark by incorporating other NLU tasks, such as dependency parsing, offering a more comprehensive evaluation field. Furthermore, exploring the potential of initializing Bangla Natural Language Generation (NLG) models from BanglaBERT could further boost the language processing ecosystem in the Bangla language.

This paper importantly bridges a gap in Bangla NLP resources and opens the door for more tailored LLMs that accurately reflect the linguistic nuances of low-resource languages like Bangla. The public release of their datasets and models encourages academic and practical exploration in this domain, emphasizing a community-driven advancement of Bangla language technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Abhik Bhattacharjee (12 papers)
  2. Tahmid Hasan (10 papers)
  3. Wasi Uddin Ahmad (41 papers)
  4. Kazi Samin (3 papers)
  5. Md Saiful Islam (107 papers)
  6. Anindya Iqbal (24 papers)
  7. M. Sohel Rahman (52 papers)
  8. Rifat Shahriyar (25 papers)
Citations (154)