PhoBERT-CNN Hybrid

Updated 21 December 2025

The paper introduces a hybrid model that integrates PhoBERT with a multi-channel Text-CNN to effectively capture both contextual and localized lexical features for Vietnamese hate speech detection.
It employs a rigorous two-phase data cleaning and Easy Data Augmentation strategy to address orthographic noise and class imbalance, significantly boosting macro-F1 scores.
Empirical evaluations on ViHSD and HSD-VLSP benchmarks, along with real-time streaming deployment, demonstrate its superior performance and practical applicability in content moderation.

The PhoBERT-CNN Hybrid is a composite neural architecture that leverages the monolingual Vietnamese transformer model PhoBERT as an embedding encoder, paired with a multi-channel convolutional (Text-CNN) classification head. This approach is designed for high-accuracy hate speech detection (HSD) in the Vietnamese language, addressing distinctive challenges found in social media text such as orthographic noise, morphological complexity, and severe class imbalance. The architecture achieves state-of-the-art macro-averaged F1-scores on leading Vietnamese HSD benchmarks and is deployable in real-time streaming pipelines (Tran et al., 2022).

1. Hybrid Model Architecture

The PhoBERT-CNN pipeline is structured as follows: Given a pre-processed Vietnamese token sequence $X=(x_1,\ldots,x_T)$ , the PhoBERT transformer generates contextualized word embeddings $E\in\mathbb{R}^{T\times d}$ , where $d$ is the hidden size (e.g., $d=768$ for PhoBERT $_\text{large}$ ). These embeddings are passed to parallel 1D convolution channels (with $m$ different kernel sizes $k_i\in\{1,2,3,5\}$ , each with 32 filters). For each channel, filters $W_i$ and biases $b_i$ produce feature maps via

$C_i[j] = \text{ReLU}(W_i\ast E[j:j+k_i-1] + b_i).$

Max-pooling over time yields pooled feature vectors $P_i = \max_j C_i[j]$ , which are concatenated as $P = [P_1; \ldots ;P_m] \in \mathbb{R}^m$ , then mapped by a fully connected softmax layer to produce the label distribution $\hat y \in \mathbb{R}^M$ ( $M=3$ classes: CLEAN, OFFENSIVE, HATE). The loss is cross-entropy,

$\mathcal{L}=-\sum_{c=1}^M y_{c}\log\hat y_{c}.$

This architecture exploits PhoBERT’s monolingual vocabulary and transformer context encoding, while the CNN layer distills localized lexical features relevant to HSD (Tran et al., 2022).

2. Data Pre-processing and Augmentation

A two-stage cleaning pipeline is employed to address the orthographic and lexical noise prevalent in Vietnamese social media:

Phase 1 (character-level): Lowercasing, redundant whitespace/hyperlink removal, Unicode normalization (to UTF-8 NFC), deletion of repeated characters, and normalization of Vietnamese diacritic placement.
Phase 2 (token-level): Vietnamese word segmentation (VnCoreNLP for PhoBERT-based models), mapping of teencode (slang/abbreviation) to standard forms, and stopword removal using a published Vietnamese stopword list.

To address class imbalance (e.g., CLEAN ≈ 83%, OFFENSIVE ≈ 7%, HATE ≈ 10% in ViHSD), Easy Data Augmentation (EDA) techniques [Wei & Zou, 2019] are applied to minority-class examples: synonym substitution, random insertion, swap, and deletion, with $\alpha=0.15$ replacement ratio. Augmentation is performed only for minority classes to balance the training set. All models are subsequently retrained on this balanced, enriched corpus (Tran et al., 2022).

3. Training Setup and Hyperparameters

PhoBERT-CNN employs:

Batch size: $64$
Learning rate: $2\times10^{-5}$
Optimizer: AdamW (linear warmup)
Weight decay: $0.01$
Epochs: $3$ (ViHSD); 5-fold cross-validation (HSD-VLSP)
Dropout (CNN layer): $0.4$
Convolutional channels: $k_i\in\{1,2,3,5\}$ , 32 filters each
No additional class weighting is used due to prior data augmentation

Baseline models include NB, SVM, RF (on TF-IDF features, $n\in\{1,2\}$ , with hyperparameter tuning) and deep networks (Text-CNN, Bi-LSTM with fastText/PhoW2V initialization). Non-transformer architectures are trained for up to 10 epochs with batch size $64$, dropout $0.4$, and early stopping (Tran et al., 2022).

4. Empirical Evaluation

The PhoBERT-CNN model is benchmarked on ViHSD (33,400 comments, 7:1:2 split) and HSD-VLSP (20,345 comments). Both show severe class imbalance. Evaluation metrics are macro-F1 and accuracy. Results:

Model	ViHSD F1	ViHSD Acc	HSD-VLSP F1	HSD-VLSP Acc
Text-CNN+fastText	61.67	86.98	85.76	97.14
Bi-LSTM+PhoW2V	62.66	85.99	84.04	96.79
BERT-large	60.29	84.52	85.41	96.19
RoBERTa-large	61.49	83.04	85.79	96.95
XLM-R-large	62.38	83.62	86.57	97.15
PhoBERT-large	63.51	87.13	86.68	97.58
BERT-CNN	61.26	85.90	86.37	96.17
RoBERTa-CNN	62.47	84.54	86.48	96.38
XLM-R-CNN	63.34	85.48	88.53	96.92
PhoBERT-CNN	67.46	87.76	98.45	98.59

Ablation confirms PhoBERT yields ≈3% macro-F1 gain over multilingual baselines, the CNN layer further adds 1–4%, EDA augmentation provides 3–8%, and two-phase cleaning yields 4–7% improvement relative to simpler normalizations. Error analysis indicates most misclassifications occur on ambiguous or sarcastic content (Tran et al., 2022).

5. Real-Time Streaming Application

PhoBERT-CNN is operationalized within an Apache Spark Structured Streaming pipeline (Spark 3.1.1). The system ingests live YouTube comments via TCP, applies the two-phase cleaning, processes text through the PhoBERT-CNN model, and writes predictions to Parquet storage for monitoring. A web interface (SparkSQL + REST) enables real-time visualization. On hardware comprising Intel i7 CPU, 16 GB RAM, and NVIDIA GTX 1650 Ti, the pipeline achieves throughput of 0.64 comments/sec (1.56 s per comment), with accuracy of 82.02% and macro-F1 of 58.19% on 500 annotated streaming samples. Inference latency is dominated by PhoBERT processing; reducing this bottleneck requires model distillation or batching (Tran et al., 2022).

6. Strengths, Limitations, and Future Work

The model’s monolingual transformer encoder provides superior capture of Vietnamese morphological and idiomatic nuances compared to multilingual alternatives. The CNN decoder efficiently extracts local lexical patterns, and the strict two-phase cleaning pipeline plus minority-only data augmentation are effective in mitigating class imbalance without costly annotation.

Limitations are present. PhoBERT-CNN's error profile is dominated by failures on sarcastic or highly context-dependent speech; the architecture does not explicitly model discourse context. Latency (~1.5s per comment) may be excessive for time-sensitive applications, motivating exploration of distilled models (e.g., DistilPhoBERT). Explainability is limited, with potential for incorporating attention-based rationale extraction frameworks such as HateXplain. Extension to span-level detection and multi-aspect annotation (e.g., target specificity, severity) remains an open area. Finally, additional streaming benchmarks, including high-throughput and fault-tolerance, would inform scalability for production deployment (Tran et al., 2022).

In summary, the PhoBERT-CNN hybrid, augmented by rigorous Vietnamese-specific pre-processing and targeted data augmentation, establishes new state-of-the-art performance for Vietnamese hate speech detection, and demonstrates feasibility for real-time content moderation while highlighting avenues for improved efficiency and interpretability (Tran et al., 2022).

PDF Markdown Chat (Pro)

References (1)

Vietnamese Hate and Offensive Detection using PhoBERT-CNN and Social Media Streaming Data (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to PhoBERT-CNN Hybrid.