BanglaFake Dataset: Bangla Misinformation & Deepfake
- BanglaFake is a comprehensive collection featuring curated text and deepfake audio corpora that enable the detection of Bangla misinformation with detailed metadata and balanced class distributions.
- The dataset incorporates rigorous preprocessing and feature engineering techniques, including tokenization, TF–IDF and neural embeddings, to support both classical and deep learning model evaluations.
- Benchmarking studies reveal strong performance from methods like SVM, BERT variants, and advanced neural networks, highlighting its value for scalable fake news and deepfake detection research in low-resource settings.
BanglaFake Dataset
The BanglaFake dataset refers to a family of large-scale, curated corpora developed for the detection of misinformation, manipulation, and audio deepfakes in Bangla (Bengali) language contexts. The designation “BanglaFake” appears both as a generic term for early benchmark text datasets on Bangla fake news and as the explicit name of a released deepfake audio corpus. These datasets provide comprehensive, multi-modal resources with well-documented annotation protocols and detailed metadata, serving as the primary foundation for both classical and neural misinformation detection in low-resource Bangla.
1. Text-Based BanglaFake/BanFakeNews Datasets
1.1. Origins and Construction
The canonical BanglaFake text dataset originated with the BanFakeNews project, which compiled approximately 50,000 news items from 22 major Bangla news portals and several fact-checking platforms between 2018 and 2020 (Hossain et al., 2020). Authentic samples (≈48,700) were crawled from mainstream portals, while fake news instances (≈1,300)—including misleading/contextual misinformation, clickbait, and satire/parody—were assembled from jaachai.com, bdfactcheck.com, and Bangla satire sites. Metadata collection encompasses domain URL, publication time, mapped article categories (12 high-level unifications), headline–article relatedness, and attribution source (person/organization). Annotation involved undergraduate computer science students, with rigorous validation for fake-news items against fact-checking sites and duplication removal. Labels comprise a binary real/fake distinction, with internal subtypes: False Context, Clickbait, and Satire/Parody.
Following research reused and expanded this core resource. Notably, the BanFakeNews-2.0 dataset increased the size to 60,000 news items (47,000 authentic and 13,000 fake) across 13 news categories, with a more balanced distribution of fake-news items per topic and a thoroughly validated independent test set (Shibu et al., 16 Jan 2025).
1.2. Dataset Structure and Statistics
The principal text datasets demonstrate:
- Total size: N = 48,678 real + 1,299 fake = 49,977 (Hossain et al., 2020); BanFakeNews-2.0: 47,000 real + 13,000 fake (Shibu et al., 16 Jan 2025).
- Train/test/validation splits: canonical 70%/30% with a further 10% test set for early stopping; BanFakeNews-2.0 recommends 70%/15%/15%.
- Class imbalance: authentic items dominate (typically 97% real in early dataset; BanFakeNews-2.0 reduces this to ≈1:3 fake:real).
- News categories: typically 12 or 13, including National, International, Sports, Politics, Editorial, Miscellaneous, Education, Technology, Crime, Finance, Entertainment, Medical, and Religious.
- Per-article means: authentic characters, words, sentences; fake , , .
- Subset with explicit metadata: ≈8,500 items include full source and headline–body relation fields.
- Distribution across categories is skewed, with real items concentrated in National and International, and fake items mostly in Miscellaneous and specific topical areas (Hossain et al., 2020).
2. Preprocessing and Feature Engineering
Text preprocessing pipelines incorporate Unicode normalization, lowercasing, Bangla stopword removal (Stopwords-ISO), and optional punctuation/numeric normalization. Tokenization is applied at both word and character granularity.
Feature sets evaluated for classical and neural models include:
- Lexical: Word n-grams (), character n-grams (), TF–IDF weighting
- Syntactic: POS-tag frequency distribution (using 10 Bangla POS classes)
- Semantic: Article-level mean and standard deviation of pre-trained word embeddings (FastText 300-dim, Word2Vec 100-dim), with corpus coverage at 54–55%
- Metadata: Punctuation frequency, normalized Alexa-rank of domain, headline/body length
- Neural pipeline experiments employ end-to-end trainable embeddings with maximum lengths (e.g., 250 tokens or 512 subwords for Transformers) (Hossain et al., 2020, Roy et al., 2024, Shibu et al., 16 Jan 2025).
3. Benchmarking and Model Performance
Textual BanglaFake datasets have been evaluated with baselines, shallow ML, and advanced DL models:
- SVM (char-3-grams or all features): best fake-class F1 = 0.89–0.91
- CNN: F1 up to 0.59; Bi-LSTM + attention: F1 up to 0.53
- Multilingual BERT (fine-tuned): F1 = 0.68; monolingual BanglaBERT, SagorBERT, and mBERT variants: macro-F1 up to 0.87 on BanFakeNews-2.0
- LLMs: QLoRA-adapted BLOOM 560M achieves macro-F1 = 0.89 on BanFakeNews-2.0 (Shibu et al., 16 Jan 2025)
- Human annotators achieve average fake-class F1 ≈ 0.65, consistently outperformed by state-of-the-art automated systems (Hossain et al., 2020).
Text-based benchmarks employ precision, recall, F1, macro-F1, and area under the ROC curve as standard metrics, focusing particularly on fake-class (positive) detection.
4. Deepfake Audio: BanglaFake Speech Corpus
The BanglaFake name is explicitly attached to two major Bengali deepfake audio datasets: (Fahad et al., 16 May 2025) and (Samu et al., 25 Dec 2025). These corpora comprise:
- 12,260 real + 13,260 deepfake utterances (total 25,520 clips), each 6–7 seconds; mean duration ≈6.5s
- Real speech sourced from the SUST TTS corpus and Mozilla Common Voice (7 speakers, both genders, multi-dialect Bangla)
- Synthetic/deepfake speech generated using VITS-based TTS models (conditional VAE, flow-based latent prior, HiFi-GAN decoder), trained on SUST TTS
- All audio is WAV, 22,050 Hz, 16-bit PCM; transcripts in LJ-Speech CSV convention
- Slight class imbalance (fake:real ≈1.08), no speaker-level imbalance
- Accessible via HuggingFace and GitHub (permissive research licenses)
Audio evaluation integrates:
- Robust MOS (mean opinion score, human): naturalness = 3.40, intelligibility = 4.01 (five-point scale)
- Feature analysis via MFCC extraction and t-SNE visualization: real and fake utterances partially overlap in acoustic space, highlighting detection challenge (Fahad et al., 16 May 2025)
- Benchmarking with CNN/RNN/transformer architectures (Wav2Vec2-XLSR-53, ResNet18, LCNN, ViT-B16, CNN-BiLSTM). Fine-tuned ResNet18 achieves accuracy = 79.17%, F1 = 79.12%, AUC = 84.37%, EER = 24.35%. Zero-shot transformer performance is much lower (accuracy <54%) (Samu et al., 25 Dec 2025).
5. Specialized Derivatives and Related Resources
Several variants for related NLP tasks and modalities have been derived from or linked to the core BanglaFake datasets:
- BanMANI corpus (Kamruzzaman et al., 2023): 800 social media items (Facebook posts/comments), labeled as “manipulated” vs. “non-manipulated” relative to 500 BanFakeNews reference articles, with gold spans identifying altered excerpts. Used for text-based manipulation and claim detection benchmarks; employs zero-shot/fine-tuned GPT models (fine-tuned GPT-3 ada achieves F1 = 65.77% on detection)
- Bengali Fake Review Detection (“BFRD”): food review–specific dataset from social media (1,339 fake, 7,710 authentic), with heavy code-mixed/romanized text and extensive text normalization. Weighted-ensemble transformer systems on augmented sets achieve F1 >0.98 (Shahariar et al., 2023)
6. Applications, Limitations, and Community Guidelines
The BanglaFake text and audio datasets underpin research on:
- Automated fake news filtering for Bangla web and social media applications
- Human-in-the-loop fact-checking and misinformation discovery workflows
- Benchmarking of LLMs, cross-lingual transfer, and data augmentation strategies
- Evaluation of deepfake–audio detection, adversarial speech synthesis, and anti-spoofing
Reported limitations include strong inherent class imbalance (especially in text, with early real/fake ratios near 39:1), source-domain and topical skew, limited annotation coverage for metadata fields, absence of user-demographic and propagation information, and loss of long-range context (some versions restrict context to headlines/body ≤30 tokens) (Hossain et al., 2020, George et al., 24 Feb 2025).
Researchers are advised to:
- Use stratified or oversampled splits to address class imbalance
- Disclose all metrics (including precision/recall by class and macro-F1)
- Share derived versions and harmonized splits for comparability
- Engage in domain expansion and annotation refinement (e.g., manipulating claim types), and adapt multi-modal extensions as needed (Shibu et al., 16 Jan 2025, Kamruzzaman et al., 2023)
BanglaFake establishes the foundation for scalable, reproducible misinformation detection in Bangla language, supporting continual advances in text, audio, and multi-modal fake detection in low-resource settings.