Papers
Topics
Authors
Recent
Search
2000 character limit reached

BGE-M3 Classifier: Ethical Multilingual Filtering

Updated 10 February 2026
  • The paper introduces BGE-M3, a Transformer-based classifier that leverages a multilingual, fairness-aware approach to proactively screen harmful prompts.
  • It utilizes tailored architectural initialization, balanced batching, and Class-Balanced Focal Loss with a dual-language dataset to optimize performance.
  • Empirical evaluation shows an F1-score of approximately 0.81, demonstrating robust cross-lingual detection of harmful content in both English and Vietnamese.

The BGE-M3 classifier is a fine-tuned, multilingual Transformer model for proactive textual prompt screening at the input stage of generative text-to-image pipelines. Developed as the core filtering module within the SafeGen framework, BGE-M3 identifies and blocks prompts likely to elicit harmful, biased, or misleading imagery, with demonstrated robustness across English and Vietnamese. Its performance derives from tailored architectural initialization, a curated dual-language dataset, and a novel fairness-aware optimization scheme that exploits class-balanced sampling in tandem with Class-Balanced Focal Loss. Empirical evaluation establishes that BGE-M3 delivers reliable cross-lingual harmful content detection, ensuring ethical compliance in generative image workflows (Nam et al., 14 Dec 2025).

1. Model Architecture and Initialization

BGE-M3 employs a Transformer encoder backbone with 12 self-attention layers, a hidden size of 768, and 12 attention heads, with approximately 110 million parameters. Initial weights are sourced from a heterogeneous mix of BERT, RoBERTa, and XLM-RoBERTa checkpoints to optimize cross-lingual representation capacity.

Textual inputs are tokenized using a shared Byte Pair Encoding vocabulary (≈50,000 tokens), and embedded as 768-dimensional vectors. Standard positional encodings consistent with the "Attention Is All You Need" framework are applied.

A single-layer classification head, positioned atop the pooled [CLS] token representation, outputs a two-dimensional logit vector. Softmax activation produces class probabilities p=(psafe,pharmful)p = (p_{\text{safe}}, p_{\text{harmful}}) for subsequent decision logic.

2. Multilingual Data Curation and Labeling

Training and evaluation utilize a multilingual dataset comprising approximately 830,000 samples with the following composition:

  • Vietnamese fake-news articles and toxic comments (VFND), alongside Vietnamese legal ("clean") documents.
  • English toxic comment and biased news data.

A binary label, y{0,1}y \in \{0,1\}, is assigned to each example, with y=0y=0 denoting "safe" and y=1y=1 representing "harmful/misleading" prompts. The latter class encompasses requests for biased portrayals, hate speech, disinformation, or prompts likely to generate non-consensual or illicit imagery.

The dataset exhibits a notable class imbalance (730,000 safe vs. 100,000 harmful), yielding a 9:1 ratio.

3. Fairness-Aware Optimization and Training Procedures

To address class imbalance and promote model fairness, two methodological elements are employed:

  • Balanced Batching: Each mini-batch contains equal numbers of safe and harmful instances, normalizing gradient estimates and counteracting domination by the majority class.
  • Class-Balanced Focal Loss: For batch size NN and instance ii, define yiy_i as the ground-truth label and pip_i as the softmax probability for the harmful class. The loss is:

LCBF=1Ni=1Nαyi(1piyi)γlogpiyi\mathcal{L}_{\text{CBF}} = -\frac{1}{N}\sum_{i=1}^N \alpha_{y_i}(1 - p_i^{y_i})^\gamma\log p_i^{y_i}

where αyi\alpha_{y_i} is inversely proportional to class frequency, focusing learning on harmful samples, and γ=2\gamma=2 is the focusing parameter to de-emphasize well-classified instances.

No adversarial or demographic-specific penalty terms were employed. The balanced-batch sampling and focal loss were sufficient to mitigate skew and enhance minority-class sensitivity.

4. Inference, Thresholding, and Post-Processing

At inference, BGE-M3 computes pharmfulp_{\text{harmful}} for the input prompt. The acceptance rule is:

  • If pharmful0.50p_{\text{harmful}} \geq 0.50, the prompt is rejected and accompanied by an explanatory return based on the most probable detected harmful category.
  • If pharmful<0.50p_{\text{harmful}} < 0.50, the prompt passes to the generative model (Hyper-SD) for downstream synthesis.

No ensembling or post-calibration was necessary, as validation-stage probabilities exhibited strong interclass separation.

5. Evaluation Methodology and Quantitative Performance

Evaluation proceeded on a stratified 10% test split from the full corpus. Standard classification metrics were employed:

Precision=TPTP+FP Recall=TPTP+FN F1-score=2Precision×RecallPrecision+Recall\text{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} \ \text{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \ \text{F1-score} = 2\frac{\mathrm{Precision}\times\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}

with TP, FP, and FN computed for the harmful class.

Results obtained:

  • Accuracy: 0.8215
  • F1-score: 0.8145
  • English subset F1: 0.816
  • Vietnamese subset F1: 0.813

These results indicate strong cross-lingual robustness. Slightly elevated false negative rates on culturally-specific harm categories highlight the current taxonomy's limitations.

6. Comparative Ablation Studies

Ablation and comparative studies isolate the impact of model and training choices:

Model Configuration Fine-Tuned F1 Score
BGE-M3 (base, no fine-tuning) No 0.1840
PhoBERT-base-v2 (fine-tuned) Yes 0.6862
keepitreal/vietnamese-sbert (fine-tuned) Yes 0.1587
xlm-roberta-large (fine-tuned) Yes 0.0932

Domain-specific fine-tuning yields the greatest performance delta (Δ\DeltaF1 +0.63\approx+0.63 over the pretrained base). Balanced batching and focal weighting each provide an incremental F1 gain of approximately +0.05+0.05. Changes to backbone architecture have comparatively marginal effect, underscoring the dominant role of curated multilingual data and fairness-aware optimization.

7. Significance and Limitations

BGE-M3 demonstrates the practical viability of lightweight, fairness-aware Transformer-based prompt classifiers for pre-generation ethical content filtering in multilingual generative workflows. It achieves reliable interception of harmful textual inputs (F1 \approx 0.81) without sacrificing language or category robustness.

Unaddressed in this iteration are explicit group-based fairness metrics and comprehensive coverage of culturally-nuanced harmful prompt categories. A plausible implication is that extending label taxonomies and incorporating demographic-aware supervision could further enhance harm detection, particularly for rare or context-dependent categories (Nam et al., 14 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BGE-M3 Classifier.