Bangla Abstractive Summarization Dataset

Updated 1 December 2025

The paper introduces structured Bangla document–summary pairs to benchmark and train abstractive summarization models.
It details robust data collection, annotation, and preprocessing methods tailored to the Bangla language.
Evaluation metrics and baseline results emphasize the effectiveness of transformer models over traditional LSTM approaches.

A Bangla abstractive summarization dataset is a structured corpus of Bengali-language document–summary pairs specifically designed for benchmarking and training automatic summarization systems that aim to generate human-like, non-extractive synopses rather than merely selecting source sentences verbatim. The advancement of such datasets underpins the recent evolution of supervised learning and neural architectures for low-resource languages like Bangla. This entry enumerates the principal Bangla abstractive summarization corpora, highlights annotation and preprocessing protocols, details dataset statistics and structure, and discusses evaluation standards, technical challenges, and current research directions, with explicit references to peer-reviewed sources.

1. Major Bangla Abstractive Summarization Datasets

Several large-scale and moderately-sized corpora for Bangla abstractive summarization have been developed in recent years. These include news, religious, educational, and multi-domain datasets:

Dataset	Pairs	Domain(s)	Public Access / License	Reference
BANSData	19,096	News (bdnews24.com)	CC BY-NC-SA 4.0, Kaggle	(Bhattacharjee et al., 2020, Dhar et al., 2021)
BANS-133	133,148	News (bdnews24.com)	CC BY-NC-SA 4.0, GitHub	(Dhar et al., 2021)
MultiBanAbs	54,620	News (Samakal), business (TBS), blogs (Cinegolpo)	Permissive academic, Kaggle	(Ferdous et al., 24 Nov 2025)
BeliN	2,520	Religious news (14 sources)	As per repo license, GitHub	(Osama et al., 2 Jan 2025)
NCTB	139	Textbooks/educational (NCTB)	MIT License, GitHub	(Chowdhury et al., 2021)

These corpora are constructed with native-script Bangla, varying in scale from <200 pairs (NCTB) to >100k pairs (BANS-133). All datasets comprise paired human-written abstractive summaries and articles, though annotation protocol and domain diversity differ substantially.

2. Data Collection, Annotation, and Preprocessing

Corpus construction protocols emphasize source diversity, editorial consistency, and linguistic normalization:

Source selection: News domains draw from professionally edited Bangladeshi portals (bdnews24.com, Samakal, The Business Standard, Prothom Alo, etc.), whereas NCTB covers curriculum expository texts and BeliN consolidates religious reporting from 14 outlets.
Summary provenance: For BANSData and BANS-133, summaries are publisher-provided headlines or ledes; no additional rewriting is performed. MultiBanAbs summaries are the abstracts provided by content creators at source. BeliN includes manual labeling for category, aspect, and sentiment for contextual supervision. In NCTB, summaries are written by professional summary writers and edited for curriculum relevance.
Preprocessing:
- Standard Unicode normalization (NFKC) and HTML cleaning.
- Tokenization tailored to Bangla script (model-specific or BNLP toolkit procedures).
- Filtering to remove non-Bangla, duplicate, and insufficiently long samples: e.g., BANSData excludes articles <50 tokens and summaries <5.
- For BeliN, extra fields are concatenated for transformer-based input with explicit [SEP] tokens.
- NCTB documents undergo POS tagging, punctuation filtering, and stopword removal.
Splitting: Most datasets adopt a 70/20/10 or 80/10/10 train/validation/test split, stratified to preserve domain or aspect balance (e.g., BeliN).

3. Dataset Structure and Format

All datasets provide structured file formats suitable for neural modeling pipelines:

JSONL or JSON is standard, with one record per sample.
Fields typically include unique ID, article text, summary text, and sometimes metadata (e.g., category, domain, aspect, sentiment).
BeliN fields: { Article, Headline, Category, Aspect, Sentiment }.
MultiBanAbs fields: { id, source, article_text, summary_text, domain_label }.
BANSData / BANS-133 fields: { id, article, summary, split }.
NCTB fields: { doc_id, document, summary }.
Pre-tokenization and truncation: For transformer inputs, articles are truncated to 512 tokens, summaries or headlines to 64–128 tokens for computational tractability.

4. Corpus Statistics and Abstractive Characteristics

The corpora demonstrate a wide range in document and summary lengths, abstraction levels, and domain coverage:

Length distributions:
- BANSData: articles mean ≈200 tokens, summaries mean ≈15 tokens.
- BANS-133: articles mean ≈230 tokens, summaries mean ≈18 tokens.
- MultiBanAbs: articles min 63/max 1,052 words (mean ≈262), summaries min 1/max 63 words (mean ≈30).
- BeliN: articles mean 1,001 words (mean 32.75 sents), headlines mean 17.13 words.
- NCTB: documents mean 91.33 tokens, summaries mean 36.23 tokens, copy rate 27% (high abstraction).
Domain heterogeneity: MultiBanAbs is explicitly multi-domain (news, financial, blog), offering higher lexical and syntactic variability; BeliN is confined to the religious news domain.
Abstraction: Most Bengali corpora utilize editor-written summaries or headlines, favoring paraphrase, compression, and abstraction. NCTB is specifically noted for a high paraphrasic rewrite rate, supporting its use in modeling abstractive summarization.
Vocabulary:
- MultiBanAbs: 346,869 unique tokens.
- BeliN: article vocab ≈9,750; headline vocab ≈1,410.
- BANSData: most frequent 40,000 tokens used in modeling.

5. Benchmarking, Evaluation Metrics, and Baseline Systems

All datasets are benchmarked using established n-gram-based and subsequence-based measures:

Standard metrics:
- ROUGE-N (unigram/bigram overlap), recall/precision/F₁ variants as per standard mathematical definitions.
- ROUGE-L (Longest Common Subsequence): both precision, recall, and F₁ computed.
- BLEU (corpus-level, with brevity penalty).
Reported baseline performance:

Dataset	Model	ROUGE-1	ROUGE-2	ROUGE-L	BLEU	Reference
BANSData	Local Attn LSTM	0.31	—	0.30	0.11	(Bhattacharjee et al., 2020)
BANS-133	Pointer-Generator	0.67	0.42	0.49	—	(Dhar et al., 2021)
MultiBanAbs	BanglaT5-small	24.01	12.12	20.20	8.38	(Ferdous et al., 24 Nov 2025)
BeliN	BanglaT5+context	—	—	24.19	18.61	(Osama et al., 2 Jan 2025)
NCTB	BenSumm	0.1217	0.0192	0.1135	—	(Chowdhury et al., 2021)

Models include attention-enhanced LSTMs, transformer encoder–decoders (BanglaT5, mT5), pointer-generator architectures, and analytic graph-based fusion (BenSumm).
BeliN achieves +15.7% BLEU and +4.8% ROUGE-L improvement from contextual feature fusion.
MultiBanAbs highlights the importance of Bangla-specific pretraining: BanglaT5-small substantially outperforms LSTM and mT5-small.
- Qualitative and human judgment:
BANSData is evaluated by native speakers (average rating 2.80/5 on fluency/informativeness).
NCTB dataset employs editorial review for accuracy and curriculum alignment.
- Model evaluation is fully aligned with the metrics' formal definitions given in the literature and the respective papers’ appendices.

6. Applications, Challenges, and Future Directions

The development of Bangla abstractive summarization datasets addresses the scarcity of low-resource parallel text corpora, enabling:

Supervised sequence-to-sequence training for headline generation, news summarization, and educational content simplification.
Multi-task learning approaches leveraging provided labels (e.g., category, sentiment in BeliN).
Transfer learning and domain adaptation, with multi-domain datasets (MultiBanAbs) providing robustness across news, business, and informal styles.
Benchmarking new architectures: transformer variants and contextual fusion models are now tractable for Bangla due to advancements in dataset size and diversity.

Chief limitations include domain imbalance (e.g., MultiBanAbs has only 1.3% blog data), fixed input length constraints (e.g., 8.3% of articles in MultiBanAbs are truncated), and variable annotation quality (e.g., reliance on headlines as gold-standard summaries). There is a notable absence of large-scale human evaluations and multi-reference test sets for coherent human assessment. Expansion plans include covering additional domains (scientific, social, conversational), enriching manual annotations (factuality, coherence), and leveraging retrieval-augmented and reinforcement learning strategies for further gains in abstraction and factual consistency (Ferdous et al., 24 Nov 2025, Osama et al., 2 Jan 2025).

7. Public Availability and Usage

All major Bangla abstractive summarization corpora are publicly released for non-commercial academic research, with repositories hosted on GitHub or Kaggle. Typical licenses are CC BY-NC-SA 4.0 for BANSData, BANS-133, and MultiBanAbs; MIT for NCTB; custom repository licenses for BeliN. Scholars should refer to each dataset’s README and license file for reuse compliance. These datasets define the current benchmarks for Bangla summarization research and support the development and evaluation of both generic and context-aware abstractive models (Ferdous et al., 24 Nov 2025, Dhar et al., 2021, Osama et al., 2 Jan 2025, Chowdhury et al., 2021, Bhattacharjee et al., 2020).