BGE-M3 Classifier: Ethical Multilingual Filtering

Updated 10 February 2026

The paper introduces BGE-M3, a Transformer-based classifier that leverages a multilingual, fairness-aware approach to proactively screen harmful prompts.
It utilizes tailored architectural initialization, balanced batching, and Class-Balanced Focal Loss with a dual-language dataset to optimize performance.
Empirical evaluation shows an F1-score of approximately 0.81, demonstrating robust cross-lingual detection of harmful content in both English and Vietnamese.

The BGE-M3 classifier is a fine-tuned, multilingual Transformer model for proactive textual prompt screening at the input stage of generative text-to-image pipelines. Developed as the core filtering module within the SafeGen framework, BGE-M3 identifies and blocks prompts likely to elicit harmful, biased, or misleading imagery, with demonstrated robustness across English and Vietnamese. Its performance derives from tailored architectural initialization, a curated dual-language dataset, and a novel fairness-aware optimization scheme that exploits class-balanced sampling in tandem with Class-Balanced Focal Loss. Empirical evaluation establishes that BGE-M3 delivers reliable cross-lingual harmful content detection, ensuring ethical compliance in generative image workflows (Nam et al., 14 Dec 2025).

1. Model Architecture and Initialization

BGE-M3 employs a Transformer encoder backbone with 12 self-attention layers, a hidden size of 768, and 12 attention heads, with approximately 110 million parameters. Initial weights are sourced from a heterogeneous mix of BERT, RoBERTa, and XLM-RoBERTa checkpoints to optimize cross-lingual representation capacity.

Textual inputs are tokenized using a shared Byte Pair Encoding vocabulary (≈50,000 tokens), and embedded as 768-dimensional vectors. Standard positional encodings consistent with the "Attention Is All You Need" framework are applied.

A single-layer classification head, positioned atop the pooled [CLS] token representation, outputs a two-dimensional logit vector. Softmax activation produces class probabilities $p = (p_{\text{safe}}, p_{\text{harmful}})$ for subsequent decision logic.

2. Multilingual Data Curation and Labeling

Training and evaluation utilize a multilingual dataset comprising approximately 830,000 samples with the following composition:

Vietnamese fake-news articles and toxic comments (VFND), alongside Vietnamese legal ("clean") documents.
English toxic comment and biased news data.

A binary label, $y \in \{0,1\}$ , is assigned to each example, with $y=0$ denoting "safe" and $y=1$ representing "harmful/misleading" prompts. The latter class encompasses requests for biased portrayals, hate speech, disinformation, or prompts likely to generate non-consensual or illicit imagery.

The dataset exhibits a notable class imbalance (730,000 safe vs. 100,000 harmful), yielding a 9:1 ratio.

3. Fairness-Aware Optimization and Training Procedures

To address class imbalance and promote model fairness, two methodological elements are employed:

Balanced Batching: Each mini-batch contains equal numbers of safe and harmful instances, normalizing gradient estimates and counteracting domination by the majority class.
Class-Balanced Focal Loss: For batch size $N$ and instance $i$ , define $y_i$ as the ground-truth label and $p_i$ as the softmax probability for the harmful class. The loss is:

$\mathcal{L}_{\text{CBF}} = -\frac{1}{N}\sum_{i=1}^N \alpha_{y_i}(1 - p_i^{y_i})^\gamma\log p_i^{y_i}$

where $\alpha_{y_i}$ is inversely proportional to class frequency, focusing learning on harmful samples, and $\gamma=2$ is the focusing parameter to de-emphasize well-classified instances.

No adversarial or demographic-specific penalty terms were employed. The balanced-batch sampling and focal loss were sufficient to mitigate skew and enhance minority-class sensitivity.

4. Inference, Thresholding, and Post-Processing

At inference, BGE-M3 computes $p_{\text{harmful}}$ for the input prompt. The acceptance rule is:

If $p_{\text{harmful}} \geq 0.50$ , the prompt is rejected and accompanied by an explanatory return based on the most probable detected harmful category.
If $p_{\text{harmful}} < 0.50$ , the prompt passes to the generative model (Hyper-SD) for downstream synthesis.

No ensembling or post-calibration was necessary, as validation-stage probabilities exhibited strong interclass separation.

5. Evaluation Methodology and Quantitative Performance

Evaluation proceeded on a stratified 10% test split from the full corpus. Standard classification metrics were employed:

$\text{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} \ \text{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \ \text{F1-score} = 2\frac{\mathrm{Precision}\times\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$

with TP, FP, and FN computed for the harmful class.

Results obtained:

Accuracy: 0.8215
F1-score: 0.8145
English subset F1: 0.816
Vietnamese subset F1: 0.813

These results indicate strong cross-lingual robustness. Slightly elevated false negative rates on culturally-specific harm categories highlight the current taxonomy's limitations.

6. Comparative Ablation Studies

Ablation and comparative studies isolate the impact of model and training choices:

Model Configuration	Fine-Tuned	F1 Score
BGE-M3 (base, no fine-tuning)	No	0.1840
PhoBERT-base-v2 (fine-tuned)	Yes	0.6862
keepitreal/vietnamese-sbert (fine-tuned)	Yes	0.1587
xlm-roberta-large (fine-tuned)	Yes	0.0932

Domain-specific fine-tuning yields the greatest performance delta ( $\Delta$ F1 $\approx+0.63$ over the pretrained base). Balanced batching and focal weighting each provide an incremental F1 gain of approximately $+0.05$ . Changes to backbone architecture have comparatively marginal effect, underscoring the dominant role of curated multilingual data and fairness-aware optimization.

7. Significance and Limitations

BGE-M3 demonstrates the practical viability of lightweight, fairness-aware Transformer-based prompt classifiers for pre-generation ethical content filtering in multilingual generative workflows. It achieves reliable interception of harmful textual inputs (F1 $\approx$ 0.81) without sacrificing language or category robustness.

Unaddressed in this iteration are explicit group-based fairness metrics and comprehensive coverage of culturally-nuanced harmful prompt categories. A plausible implication is that extending label taxonomies and incorporating demographic-aware supervision could further enhance harm detection, particularly for rare or context-dependent categories (Nam et al., 14 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SafeGen: Embedding Ethical Safeguards in Text-to-Image Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BGE-M3 Classifier.