AI-Generated Text Detection in Bengali

Updated 1 January 2026

The paper demonstrates that fine-tuned transformer models achieve state-of-the-art detection accuracy (~91%) for distinguishing AI-generated from human-written Bengali text.
AI-generated text detection in Bengali is a binary classification task that leverages curated datasets and transformer architectures to overcome complex linguistic morphology.
The study finds that high-capacity models like XLM-RoBERTa and mDeBERTa outperform parameter-shared models such as IndicBERT in capturing Bengali-specific syntactic nuances.

AI-generated text detection in Bengali addresses the technical and practical challenge of distinguishing human-written content from that produced by LLMs such as GPT-3.5. The Bengali language poses unique detection challenges due to its complex morphology, rich inflection, and highly variable syntax. Recent research has focused on empirical evaluation of transformer architectures—especially under both zero-shot and fine-tuning settings—illuminating both modeling limitations and successful detection strategies in the Bengali context (Islam et al., 25 Dec 2025).

1. Problem Formulation and Dataset Construction

AI-generated text detection is commonly formalized as a supervised binary classification problem: given a Bengali sentence $x$ , predict $y \in \{0, 1\}$ , where 0 denotes human-authored and 1 denotes AI-paraphrased text. The BanglaTextDistinguish dataset serves as the largest and most curated resource for this task in Bengali. It consists of 6,640 instances (mean sentence length ≈ 43.5 words, standard deviation 23.2), balanced between 3,320 human-authored and 3,320 GPT-3.5-paraphrased samples. Source domains span newspapers, textbooks, and social media, ensuring both formal and informal textual diversity. Each human sentence was paraphrased using GPT-3.5 and extensive deduplication was performed to ensure sample quality (Islam et al., 25 Dec 2025).

2. Transformer Architectures and Zero-Shot Regimes

The comparative study evaluates five pretrained transformer encoders, covering both large-scale multilingual and Bengali-focused models:

XLM-RoBERTa-Large: 550M parameters, 100-language pretraining, high capacity
mDeBERTaV3-Base: 140M parameters, ELECTRA-style pretraining with disentangled attention
BanglaBERT-Base: 110M parameters, Bengali monolingual pretraining
IndicBERT-Base: 110M parameters, multilingual parameter-sharing across 12 Indian languages
MultilingualBERT-Base (mBERT): 110M parameters, 104-language vanilla BERT

Zero-shot evaluation leverages pretrained checkpoints: XLM-RoBERTa and mDeBERTaV3 use XNLI-trained heads with Bengali class labels as natural language hypotheses (“মানব-লিখিত” vs. “AI-উত্পন্ন”); other models rely on embedding-based similarity—sentences and class descriptions are pooled and cosine proximity determines label assignment. All models in zero-shot scenarios operate at near-random accuracy (≈ 49%–50%), with some exhibiting pathological behavior (maximal recall, minimal precision), demonstrating that pretrained features are insufficient without Bengali task supervision (Islam et al., 25 Dec 2025).

Model	Parameters	Pretraining Corpus/Method
XLM-RoBERTa-Large	550M	100 languages (masked LM)
mDeBERTaV3-Base	140M	ELECTRA, cross-lingual
BanglaBERT-Base	110M	Bengali monolingual
IndicBERT-Base	110M	12 Indian languages
MultilingualBERT-Base	110M	104 languages

3. Fine-Tuning Methodology

Supervised fine-tuning demonstrates drastic performance increase over zero-shot classification. Data splits are strictly stratified 60 : 20 : 20 for training, validation, and testing. Tokenization is model-specific, with maximum sequence lengths ranging from 128–256 tokens. Training employs binary cross-entropy loss (as cross-entropy over two logits), the AdamW optimizer (weight decay = 0.01), and model/batch-specific learning rates. Early stopping is applied with patience of 2–3 epochs, and mixed-precision (fp16) is used where supported. Regularization includes weight decay and optional gradient accumulation; no explicit data augmentation is performed apart from the initial GPT-3.5 paraphrasing. Hyperparameter details are as follows (Islam et al., 25 Dec 2025):

Model	LR	Batch	Epochs	Seq-Len
XLM-RoBERTa-Large	$1 \times 10^{-5}$	8	5	256
mDeBERTaV3-Base	$2 \times 10^{-5}$	16	4	256
BanglaBERT-Base	$1 \times 10^{-5}$	16	3	128
IndicBERT-Base	$2 \times 10^{-5}$	16	9	256
MultilingualBERT-Base	$2 \times 10^{-5}$	16	2	256

4. Results and Comparative Performance

Zero-shot transformer models all yield accuracy and F1-score near 50%, confirming ineffective detection without fine-tuning. Fine-tuning with ∼4,000 labeled instances produces substantial gains, with the best models reaching ~91% in both accuracy and F1:

Model	Accuracy	Precision	Recall	F₁
XLM-RoBERTa-Large	91.5%	95.8%	86.8%	91.1%
mDeBERTaV3-Base	91.4%	94.1%	88.3%	91.1%
MultilingualBERT-Base	90.8%	90.7%	91.0%	90.8%
BanglaBERT-Base	88.3%	90.8%	85.1%	87.9%
IndicBERT-Base	74.3%	78.8%	66.4%	72.1%

Fine-tuned large-scale transformers (XLM-RoBERTa, mDeBERTa, mBERT) tightly cluster at ≈ 91% on both principal metrics. BanglaBERT achieves slightly lower performance. IndicBERT shows clear underperformance (≈ 74%), attributed to parameter sharing across 12 languages, which limits its capacity for capturing elaborated Bengali-specific morphology and inflectional cues. Confusion matrices reveal that XLM-RoBERTa and mDeBERTa exhibit minimal false positives/negatives, while IndicBERT’s errors remain elevated (Islam et al., 25 Dec 2025).

5. Linguistic and Methodological Challenges

Detection in Bengali is obstructed by the language’s complex morphological structure, including case markers, compound verbs, and infixes. These properties can obscure statistical cues or token patterns typically exploited by AI-generated text detectors. Furthermore, Bengali’s flexible word order and formal/informal register variation increase the heterogeneity of human writing, complicating the identification of non-human paraphrasing signatures. The results demonstrate that high-capacity, multilingual transformer architectures fine-tuned on modestly sized, well-curated Bengali data can internalize sufficiently discriminative features, but that generic, parameter-shared models (IndicBERT) struggle with the language’s morphological idiosyncrasies (Islam et al., 25 Dec 2025).

6. Evaluation Metrics and Calibration Considerations

The standard metrics for binary classification are reported: accuracy, precision, recall, and F₁ (including F₁-macro). Calibration is assessed using the Brier score, with AUROC also computed to measure discrimination. Monitoring calibration is critical before deployment: XLM-RoBERTa exhibits slight overconfidence at mid-range probabilities—a property that may affect threshold-based classifiers in production contexts (Islam et al., 25 Dec 2025).

The principal metric formulas:

$\mathtt{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
$\mathtt{Precision} = \frac{TP}{TP + FP}$
$\mathtt{Recall} = \frac{TP}{TP + FN}$
$F_1 = \frac{2 \cdot \mathtt{Precision} \cdot \mathtt{Recall}}{\mathtt{Precision} + \mathtt{Recall}}$

where $TP$ = true positives, $TN$ = true negatives, $FP$ = false positives, $FN$ = false negatives.

7. Conclusions, Recommendations, and Future Research

Zero-shot transformer approaches in Bengali are ineffective for AI-text detection. Fine-tuning even moderate-scale datasets with high-capacity multilingual or monolingual transformer backbones yields state-of-the-art results (~91% accuracy/F1). A small, carefully balanced Bengali dataset suffices for production-quality detection when paired with appropriate models. IndicBERT’s relative weakness underscores the need for parameter capacity dedicated to Bengali-specific structure.

Recommended future directions include:

Expansion of datasets across more domains (e.g., literary, technical, chat) and modern LLMs (e.g., GPT-4, LLaMA2).
Model ensembling or knowledge distillation for lightweight, on-device detectors.
Strengthening adversarial robustness with crafted paraphrase-style attacks.
Exploring joint, cross-lingual training with morphologically similar low-resource languages to enhance generalization.

This research establishes a robust methodological foundation for AI-generated text detection in Bengali and underscores the critical role of transformer fine-tuning and language-specific modeling strategies in low-resource detection contexts (Islam et al., 25 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Detecting AI-Generated Paraphrases in Bengali: A Comparative Study of Zero-Shot and Fine-Tuned Transformers (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to AI-Generated Text Detection in Bengali.