BLP 2025 Shared Task Overview

Updated 9 December 2025

BLP 2025 Shared Task is a benchmark series for low-resource Bangla NLP, featuring tasks like multi-faceted hate speech detection and Bangla-to-Python code generation with extensive annotated data.
Teams utilize transformer ensembles, multitask learning, and adversarial perturbations to tackle challenges such as class imbalance and linguistic nuance in hate speech detection.
Evaluation protocols using metrics like Micro-F1 and Pass@1, along with shared leaderboards, promote reproducible research and methodological advances in low-resource NLP.

The BLP 2025 Shared Task series comprises a set of community benchmark challenges designed to advance state-of-the-art methods for low-resource Bangla NLP. The two principal tasks—Bangla Multi-task Hate Speech Identification (Task 1) and Bangla-to-Python Code Generation (Task 2)—were featured at IJCNLP-AACL 2025. These tasks provide large-scale annotated datasets, rigorous evaluation protocols, and leaderboards, catalyzing reproducible research and comparative analysis across teams and diverse modeling paradigms.

1. Task Definitions and Data Annotation

Task 1 focuses on multi-faceted hate speech identification in Bangla, with three subtasks:

Subtask 1A: Hate Type Classification—labels: None, Abusive, Sexism, Religious Hate, Political Hate, Profane.
Subtask 1B: Target Group Identification—labels: None, Individual, Organization, Community, Society.
Subtask 1C: Joint Detection of Type, Severity, and Target—expanding to type (six classes), severity (Little to None, Mild, Severe), and target (four classes).

Annotation covers YouTube comments, with dense multi-label assignment. The official training splits consist of 35,522 samples, with a test set of 10,200 examples. The dataset exhibits extreme class imbalance (e.g., Sexism appears only 122 times in the training split).

Task 2 targets Bangla-to-Python code generation: models receive a Bangla natural-language programming instruction and must synthesize a Python function passing a hidden suite of unit tests. Training comprises 74 examples (each with prompt, reference code, and test suite), development 400 instances, and the blind test set 500 instances. For some systems, external resources(e.g., Austin et al. 2021) supplement test suites to promote generalization.

2. Model Architectures and Methodological Innovations

Task 1: Hate Speech Identification

Teams deployed transformer-based LLMs (BanglaBERT, MuRIL, IndicBERTv2, XLM-RoBERTa, DistilBERT-multilingual) within ensemble and multitask frameworks.

Retriv (Saha et al., 10 Nov 2025):

For 1A/1B, fine-tunes each transformer independently; inference combines predicted softmax probability vectors via soft voting

$P_\text{ensemble}(y|x) = \frac{1}{3}\sum_{m=1}^{3} P_m(y|x)$

For 1C, each base model gets a shared encoder and 3 heads (type, severity, target); the multitask loss function:

$L_\text{mtl} = \alpha L_\text{type} + \beta L_\text{severity} + \gamma L_\text{target}$

(rounds use equal weighting).

Ensemble predictions are combined in weighted voting:

$P_\text{final}(y|x) = 0.5 P_\text{MuRIL}(y|x) + 0.3 P_\text{BanglaBERT}(y|x) + 0.2 P_\text{IndicBERTv2}(y|x)$

Gradient Masters (Hoque et al., 23 Nov 2025):

Utilizes K-fold cross-validated ensembling, hybridizing Bangla-specific and multilingual models.
Implements FGSM adversarial perturbation:

$\Delta = \epsilon\,\text{sign}(\nabla_x J(x, y; \theta))$

and adversarial loss:

$\tilde{J}(x, y) = \alpha J(x, y; \theta) + (1-\alpha) J(x + \Delta, y; \theta)$

Lightweight normalization addresses orthographic noise (standardizes transliterated and misspelled tokens).

Task 2: Bangla-to-Python Code Generation

Retriv (Asib et al., 10 Nov 2025):

Benchmarks six open-weight LLMs; selects Qwen2.5-Coder-14B.
Applies LLM-based Bangla→English translation of instruction; explicit test suites included in prompts.
Fine-tunes Qwen2.5-14B via QLoRA (4-bit quantization, low-rank adapters $r=128$ ) on the 74-task and MBPP datasets.

Institutes a feedback-guided inference loop: up to three passes, each utilizes incremental temperature and error-trace-driven debugging prompt.

for pass_i, T in enumerate([0.1, 0.3, 0.5], start=1):
    candidate = model.generate(instruction, temp=T)
    results = run_tests(candidate, test_list)
    if results.all_passed(): break
    prompt = build_debug_prompt(...)
    append history, continue

NALA_MAINZ (Saadi et al., 20 Nov 2025):

Multi-agent pipeline (code-generation agent $G_\theta$ , debugger agent $D_\phi$ ).
Stage 1: $G_\theta$ synthesizes candidate code, run against tests; on failure, $G_\theta$ retries once.
Stage 2: Failing codes receive error-trace extraction and are dispatched to $D_\phi$ , which applies minimal edits to fix the faults.

3. Evaluation Protocols and Quantitative Results

Task 1

Metric: Micro-F1

$\text{Micro-F1} = \frac{2 \sum_{c} TP_c}{2 \sum_{c} TP_c + \sum_{c} (FP_c + FN_c)}$

Weighted micro-F1 is used for multiple labels in multitask settings.

Leaderboard positions:

Team/Method	1A Micro-F1	1B Micro-F1	1C Weighted Micro-F1	Positions (A/B/C)
Retriv	72.75%	72.69%	72.62%	9 / 10 / 7
Gradient Masters	73.23%	73.28%	—	6 / 3 / —

Baselines:

Random: 16.38–23.04%
Majority: 56.38–60.72%
n-gram: 60.20–63.05%

Dev-set ensemble gains: soft voting yields micro-F1 ~75.72%; multitask weighted voting ~75.12%.

Task 2

Metric: Pass@1 — proportion of generated codes passing all hidden tests:

$\text{Pass@}1 = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\,\text{solution}_i\,\text{passes all tests}\,]$

Retriv: 0.934 Pass@1 (2nd place). NALA_MAINZ: 95.4% Pass@1 (1st place). The latter uses GPT-5 zero-shot with error-trace debugging, gaining +30.8pp absolute from generator-only to generator+debugger. Ablation studies: removing external tests/debbuging modules drops performance by up to ~10pp.

4. Error Analysis and Challenges

Task 1

RNN baselines (BiLSTM/BiGRU) trail deep transformer ensembles by ~7–8 pp.
Severe class imbalance leads to very low recall for Sexism and Religious Hate.
Typical confusion patterns: Abusive↔None, Political Hate↔Profane; Organization→None, Society↔Individual.
Multitask settings compound errors (type/target/severity).
Errors attributed to model misinterpretation of implicit or indirect hate and group-level attacks.

Task 2

Mistranslation and idiomatic drift in Bangla instructions (e.g., "প্রত্যাবদ্ধিতা" misrendered), degrading both initial and debug-stage performance.
Feedback loop rarely repairs deep logic errors.
Runtime error trace limitations (partial capture of failure modes, especially type mismatches).
Resource constraints: low data, small adapter ranks, and single-GPU restrict exploration.

5. Insights, Limitations, and Future Directions

Task 1

Transformer ensembles (soft voting) and multitask learning enable complementary linguistic representation, stabilize predictions, and improve recall—especially for frequent classes.
Hybrid monolingual/multilingual models transfer cross-lingual context effectively.
Computational cost remains high; ensemble weights could benefit from input-adaptive control.
Label imbalance is a persistent bottleneck; recommended remedies include data augmentation, resampling, focal loss, and cross-lingual transfer.

Task 2

Selective, test-guided debugging—using a dedicated code-fixing agent—is highly effective (+30pp Pass@1 for GPT-5).
Explicit inclusion of test cases in prompts aligns the generational process with evaluation semantics.
Scaling adapters, improving translation fidelity (back-translation, post-editing), and symbolic debugging cues are avenues for improvement.
Resource extension and crowd-sourced data collection are proposed for richer Bangla code generation benchmarks.

6. Open Source Contributions and Community Impact

Both Retriv and NALA_MAINZ teams have released their training and inference scripts, model checkpoints, and prompt templates (see GitHub URLs in respective papers (Saha et al., 10 Nov 2025, Asib et al., 10 Nov 2025, Saadi et al., 20 Nov 2025)). The public leaderboard structure and extended cross-validated evaluation frameworks greatly facilitate reproducibility for subsequent research in Bangla hate speech and code synthesis tasks. The shared tasks accelerate methodological advances—ensemble architectures, adversarial robustness, multi-stage error-guided inference—and inspire direct extensions to other low-resource NLP domains.

This overview is based entirely on the peer-reviewed findings of the BLP 2025 Shared Task as reported by participating teams and organizers (Saha et al., 10 Nov 2025, Asib et al., 10 Nov 2025, Saadi et al., 20 Nov 2025, Hoque et al., 23 Nov 2025).