PhoBERT: Vietnamese Language Model
- PhoBERT is a monolingual Vietnamese language model that leverages Transformer architecture and RoBERTa pretraining for enhanced NLP performance.
- It utilizes word-level segmentation and fastBPE tokenization on a 20GB corpus to optimize accuracy across various Vietnamese NLP tasks.
- PhoBERT consistently outperforms multilingual baselines in tasks like sequence labeling, NER, and text classification, supporting domain-specific applications.
PhoBERT is a suite of large-scale, monolingual Vietnamese LLMs based on the Transformer architecture and optimized for a wide range of Vietnamese NLP tasks. Developed by VinAI Research, PhoBERT is the first public pre-trained BERT-style model tailored specifically to Vietnamese, distinguishing itself through both its corpus construction and adapted tokenization strategies. Released in both base and large configurations, PhoBERT incorporates the RoBERTa pretraining recipe and consistently outperforms multilingual and monolingual baselines in benchmarks for sequence labeling, text classification, and domain-specific applications (Nguyen et al., 2020).
1. Model Architecture and Pretraining Paradigms
PhoBERT is available in two variants: PhoBERT\textsubscript{base} and PhoBERT\textsubscript{large}, mirroring the parameterization of BERT\textsubscript{base} and BERT\textsubscript{large} respectively. Each model consists of a multi-layer Transformer encoder following the original BERT design, but with the RoBERTa training protocol—specifically, the next-sentence objective is omitted, and dynamic masking with longer, more aggressive training is adopted.
| Variant | Layers | Hidden Size | Self-Attention Heads | Parameters (M) |
|---|---|---|---|---|
| PhoBERT\textsubscript{base} | 12 | 768 | 12 | ~135 |
| PhoBERT\textsubscript{large} | 24 | 1024 | 16 | ~370 |
Token embeddings comprise token type, positional, and (optionally) segment embeddings. The architecture leverages a word-level pre-segmentation before subword tokenization, a key adaptation for Vietnamese (Nguyen et al., 2020).
2. Pretraining Corpora and Tokenization Strategies
PhoBERT is trained on a 20 GB Vietnamese corpus, constructed from approximately 1 GB of the Vietnamese Wikipedia and a rigorously deduplicated 19 GB news crawl. All text is first segmented at the word level using VnCoreNLP's RDRSegmenter. Subword tokenization is then applied using fastBPE, producing a 64,000-type vocabulary. On average, PhoBERT processes 24.4 sub-word tokens per sentence (Nguyen et al., 2020).
A crucial design decision is the explicit word-level segmentation prior to BPE tokenization. This strategy is motivated by the morpho-syllabic structure of Vietnamese and contrasts with models such as XLM-R, where BPE is applied at the syllable level. The explicit segmentation is empirically linked to PhoBERT’s improved word-level performance on downstream Vietnamese NLP tasks (Nguyen et al., 2020).
3. Pretraining Objectives, Optimization, and Training Regime
PhoBERT’s sole pretraining objective is masked language modeling (MLM), adhering to the RoBERTa protocol. Given a set of masked positions, the objective is: Training is performed using fairseq’s RoBERTa codebase with the Adam optimizer. The maximum sequence length is 256 subword tokens. PhoBERT\textsubscript{base} is trained with a batch size of 1024 sentences, peak learning rate of , for approximately 540,000 steps over 3 weeks. PhoBERT\textsubscript{large} uses a batch size of 512, peak learning rate of , for approximately 1,080,000 steps over 5 weeks. A total of 40 epochs with a 2-epoch linear warm-up schedule are used (Nguyen et al., 2020).
4. Downstream Benchmark Performance and Comparative Evaluation
PhoBERT establishes new state-of-the-art results across multiple Vietnamese sequence and classification tasks, outperforming both prior monolingual models and large multilingual models such as XLM-R. The table below summarizes key test set results:
| Model | POS Acc. | LAS / UAS | NER F₁ | NLI Acc. |
|---|---|---|---|---|
| XLM-R\textsubscript{base} | 96.2 | 76.46 / 83.10 | 92.0 | 75.4 |
| XLM-R\textsubscript{large} | 96.3 | 75.87 / 82.70 | 92.8 | 79.7 |
| PhoBERT\textsubscript{base} | 96.7 | 78.77 / 85.22 | 93.6 | 78.5 |
| PhoBERT\textsubscript{large} | 96.8 | 77.85 / 84.32 | 94.7 | 80.0 |
Notable absolute improvements of PhoBERT\textsubscript{large} over XLM-R\textsubscript{large} are observed: +0.5 percentage points in POS, +2.95 (LAS)/+2.62 (UAS) in dependency parsing, +1.9 in NER F₁, and +0.3 in NLI accuracy (Nguyen et al., 2020). These gains persist across 5 random seeds; no significance testing is reported.
On the SMTCE benchmark for social media text classification, PhoBERT achieves a macro-F1 of 65.44% on VSMEC (Vietnamese emotion recognition), outperforming all multilingual baselines, and is highly competitive with other monolingual BERT derivatives (Nguyen et al., 2022).
5. Extensions, Hybrid Architectures, and Domain-Specific Adaptation
PhoBERT's contextual embeddings have been foundational for further architectural innovations. For Vietnamese token-level classification, the TextGraphFuseGAT architecture integrates PhoBERT with Graph Attention Networks (GATs), constructing a fully connected graph over token embeddings. This integration yields micro-F1 and macro-F1 improvements of up to 4 absolute points over vanilla PhoBERT baselines on benchmarks such as PhoNER-COVID19 and VietMed-NER (Nguyen, 13 Oct 2025). The approach demonstrates that combining PhoBERT's sequential representations with explicit token-level relational modeling substantially enhances performance in domain-specific and noisy-data contexts.
In hate speech detection, a PhoBERT-CNN hybrid outperforms prior SOTA and multilingual methods on Vietnamese social media data (e.g., macro-F1 of 67.46% on ViHSD) by augmenting PhoBERT’s outputs with a lightweight convolutional head. Data cleaning, EDA-style augmentation, and Spark-based real-time deployment further establish practical applicability (Tran et al., 2022).
6. Analysis, Limitations, and Directions for Future Work
Layer-wise analyses indicate that PhoBERT\textsubscript{large} underperforms PhoBERT\textsubscript{base} on parsing tasks, plausibly due to diminished syntactic encoding in higher layers, echoing findings from Hewitt & Manning (2019). Further layer-wise probing and analysis are recommended (Nguyen et al., 2020).
Although monolingual PhoBERT surpasses the performance of massively multilingual models trained on much larger corpora (e.g., XLM-R’s 2.5 TB), specialized in-domain pretraining and enhanced objectives (e.g., RTD from ELECTRA) may further boost performance, particularly for informal or social-media datasets (Nguyen et al., 2022).
Future directions also include extending the pretraining corpus, fine-tuning with advanced tokenization schemes to better address the rich morpho-syllabic and informal character of Vietnamese user-generated content, and domain adaptation strategies for medical, social, and legal texts.
7. Impact and Availability
PhoBERT establishes the first large-scale, public, monolingual BERT for Vietnamese, directly enabling advances in sequence labeling, sentiment and hate speech classification, and NER in both general and domain-specific Vietnamese corpora. Its open-source availability in both fairseq and HuggingFace Transformers ensures reproducibility and encourages further experimentation and model refinement (Nguyen et al., 2020). The empirical superiority of PhoBERT over large multilingual models in language-specific applications demonstrates the enduring value of monolingual pretraining in low-resource and morphologically complex languages.