Bengali Deepfake Detection
- Bengali deepfake detection is a field focused on identifying AI-generated audio and text artifacts in the Bengali language using deep learning and feature engineering.
- The methodology leverages specialized datasets such as BanglaFake, employing acoustic features like MFCCs and advanced architectures including ResNet18 to detect synthetic speech.
- Transformer models like XLM-RoBERTa and mDeBERTaV3 are fine-tuned to discern AI-generated Bengali text, achieving accuracies above 90% in controlled benchmarks.
Bengali deepfake detection encompasses the identification of AI-generated audio and text artifacts in the Bengali language, targeting both synthetic speech and paraphrased text produced by large generative models. This is a rapidly advancing field, driven by high-quality synthetic content and enabled by recent developments in speech and language synthesis technology. Systematic benchmarks now exist for both audio and text modalities, leveraging specialized Bengali datasets and advanced deep learning architectures to address the deficiencies posed by low-resource linguistic environments (Fahad et al., 16 May 2025, Islam et al., 25 Dec 2025, Samu et al., 25 Dec 2025).
1. Bengali Deepfake Audio: Dataset Foundation
The BanglaFake dataset is the central resource for Bengali audio deepfake detection, comprising 25,520 utterances (12,260 genuine, 13,260 synthetic). Genuine utterances are sourced from the SUST TTS Corpus (phonetically balanced, studio recordings) and Mozilla Common Voice (browser-collected, dialect-diverse). Seven speakers (4 male, 3 female; ages 18–30) provide coverage for real audio. Deepfakes are synthesized using a VITS-based text-to-speech pipeline (Conditional VAE, adversarial training, HiFi-GAN decoder, Flow++ normalizing flows), trained from scratch on the SUST corpus at 22.05 kHz (Fahad et al., 16 May 2025, Samu et al., 25 Dec 2025).
Data is evenly matched between real and synthetic utterances, each truncated or zero-padded to 5 s for ML experimentation. No official train/val/test splits are provided for BanglaFake, so studies either recommend cross-validation or adopt their own partitions. For benchmarking, a typical split is 70 % train, 15 % validation, 15 % test (Samu et al., 25 Dec 2025).
2. Acoustic Feature Engineering and Dataset Assessment
Acoustic differentiation in Bengali deepfake detection relies on Mel Frequency Cepstral Coefficients (MFCCs; 13 coefficients, window length 25 ms, hop 10 ms), preceded by standard preprocessing (pre‐emphasis, Hamming window, silence removal via energy VAD). t-SNE visualization of 1,000 MFCC pairs demonstrates heavy cluster overlap between real and fake samples, with synthetic data distributed in local sub-clusters (“islands”) that reflect decoder artifacts from the VITS pipeline.
Quantitative perceptual evaluation employs the Mean Opinion Score (MOS), gathering ratings from 30 native speakers (ages 20–25) on 5-point scales for naturalness and intelligibility. The outlier-robust MOS formula,
yields Robust-MOSₙₐₜᵤᵣₐₗₙₑₛₛ = 3.40 and Robust-MOSᵢₙₜₑₗₗᵢᵦᵢₗᵢₜᵧ = 4.01, indicating that deepfake audio exhibits high naturalness and intelligibility (Fahad et al., 16 May 2025).
3. Baselines and Transfer Learning for Audio Detection
Zero-shot deepfake audio detection with large pretrained models is largely ineffective for Bengali. Wav2Vec2-XLSR-53 (multilingual SSL, 56k h) achieves only 53.80 % accuracy, 56.60 % AUC, and 46.20 % EER. Other models evaluated include Whisper, PANNs CNN14, WavLM, and AST, all yielding near-chance results (Samu et al., 25 Dec 2025).
Fine-tuning significantly improves results. Benchmarked architectures include:
- Wav2Vec2-Base (SSL encoder, English pretraining)
- LCNN (MFM blocks for spectrograms)
- LCNN-Attention (self-attention over temporal frames)
- ResNet18 (spectrogram images as input, leveraging pretrained ImageNet weights)
- Vision Transformer (ViT-B16, patch-based spectrogram embedding)
- CNN-BiLSTM (hybrid for sequential reasoning)
Top results are driven by ResNet18, with accuracy 79.17 %, F1 79.12 %, AUC 84.37 %, EER 24.35 %. ViT-B16 and LCNN-Attention yield slightly superior AUC and lower EER. All models use binary cross-entropy loss, Adam/AdamW optimizers, short training schedules (2–14 epochs), and dropouts (where applicable) for regularization. ResNet18 benefits from residual connections and image modeling, facilitating gradient flow and rapid convergence (Samu et al., 25 Dec 2025).
| Model | Acc | F1 | AUC | EER |
|---|---|---|---|---|
| W2V-XLSR-53 ZS | 53.8% | 52.9% | 56.6% | 46.2% |
| ResNet18 FT | 79.17% | 79.12% | 84.37% | 24.35% |
4. Linguistic Deepfakes: Bengali AI-Generated Text
Bengali text deepfakes—AI-generated paraphrases—pose unique detection challenges. A balanced dataset, BanglaTextDistinguish (6,640 sentences), pairs human-authored samples (from news, textbooks, social media) with GPT-3.5 paraphrase outputs.
Transformer-based architectures evaluated include XLM-RoBERTa-Large, mDeBERTaV3-Base, BanglaBERT-Base, IndicBERT-Base, and MultilingualBERT-Base. Zero-shot detection protocols (XNLI pipeline, mean-pooled cosine similarity) yield near-chance performance (≈50 % accuracy). Task-specific fine-tuning is essential; after training, XLM-RoBERTa-Large and mDeBERTaV3-Base reach ≈91.5 % accuracy and F1, MultilingualBERT-Base matches (≈90.8 %), BanglaBERT-Base slightly lags (≈88 %). IndicBERT-Base underperforms (74.25 %).
| Model | Accuracy (FT) | F1 (FT) | Accuracy (ZS) | F1 (ZS) |
|---|---|---|---|---|
| XLM-RoBERTa-Large | 91.50% | 91.07% | 49.0% | 30.1% |
| mDeBERTaV3-Base | 91.35% | 91.06% | 49.2% | 11.1% |
| BanglaBERT-Base | 88.26% | 87.87% | 50.0% | 66.6% |
| IndicBERT-Base | 74.25% | 72.06% | 50.3% | 65.0% |
| MultilingualBERT-Base | 90.82% | 90.83% | 50.3% | 66.7% |
Error analysis reveals XLM-RoBERTa-Large rarely produces false positives but misses ≈13 % AI paraphrases. IndicBERT-Base suffers pronounced confusion in both directions (Islam et al., 25 Dec 2025).
5. Feature Set Optimization and Detection Strategies
Effective Bengali audio deepfake detection benefits from expanded feature sets:
- High-order MFCC delta and acceleration coefficients leverage spectral distortions from neural vocoders.
- Prosodic metrics (pitch, energy) illuminate temporal inconsistencies from duration modeling artifacts.
- Complementary phase-based descriptors (e.g., CQCCs) address amplitude-phase decoupling missed by MFCCs alone.
Advances in model architecture recommend multi-stream CNN–RNN hybrids, fusing spectral and prosodic features. Transfer learning from large multilingual deepfake corpora (ASVspoof, ADD) enables adaptation to Bengali. Self-supervised featurization (wav2vec 2.0) is advised prior to supervised fine-tuning, with projective contrastive learning to isolate cross-linguistic deepfake cues (Fahad et al., 16 May 2025).
Text detection leverages transformer backbones with careful finetuning on balanced, domain-diverse paraphrase datasets. Integration of linguistic features alongside neural embeddings and parameter-efficient adaptation approaches (e.g., LoRA) enhance deployment scalability (Islam et al., 25 Dec 2025).
6. Limitations and Prospects
Current Bengali deepfake detection is bounded by:
- Dataset scope: BanglaFake deepfakes stem from a single TTS method (VITS) and a limited speaker set; BanglaTextDistinguish covers only GPT-3.5-style paraphrasing.
- Attack diversity: Absence of adversarial and cross-dataset validation restricts generalization to novel deepfake generators.
- Modal coverage: Existing protocols do not yet fuse multimodal signals (lip movement, text prosody).
Future directions outlined by researchers include:
- Expanding datasets to include multiple TTS/VC systems and diverse real-world recording conditions.
- Adversarial and continual learning, maintaining detection performance under evolving spoofing strategies (e.g., RegO methods).
- Cross-lingual transfer from high-resource benchmarks and development of lightweight, quantized inference systems for resource-constrained deployments.
- Hybrid approaches combining transformer-derived embeddings with handcrafted linguistic cues for text, and integration of multimodal features for audio (Samu et al., 25 Dec 2025, Islam et al., 25 Dec 2025, Fahad et al., 16 May 2025).
A plausible implication is that despite robust performance (~91 % for text, ~79 % for audio), moderate EER and potential domain shifts necessitate periodic model refreshment, increased diversity in training data, and careful calibration for both high-precision and low-latency applications. These studies open pathways to higher-fidelity, cross-domain Bengali deepfake detection and informed content moderation.