PhoBERT: Vietnamese Language Model

Updated 20 December 2025

PhoBERT is a monolingual Vietnamese language model that leverages Transformer architecture and RoBERTa pretraining for enhanced NLP performance.
It utilizes word-level segmentation and fastBPE tokenization on a 20GB corpus to optimize accuracy across various Vietnamese NLP tasks.
PhoBERT consistently outperforms multilingual baselines in tasks like sequence labeling, NER, and text classification, supporting domain-specific applications.

PhoBERT is a suite of large-scale, monolingual Vietnamese LLMs based on the Transformer architecture and optimized for a wide range of Vietnamese NLP tasks. Developed by VinAI Research, PhoBERT is the first public pre-trained BERT-style model tailored specifically to Vietnamese, distinguishing itself through both its corpus construction and adapted tokenization strategies. Released in both base and large configurations, PhoBERT incorporates the RoBERTa pretraining recipe and consistently outperforms multilingual and monolingual baselines in benchmarks for sequence labeling, text classification, and domain-specific applications (Nguyen et al., 2020).

1. Model Architecture and Pretraining Paradigms

PhoBERT is available in two variants: PhoBERT\textsubscript{base} and PhoBERT\textsubscript{large}, mirroring the parameterization of BERT\textsubscript{base} and BERT\textsubscript{large} respectively. Each model consists of a multi-layer Transformer encoder following the original BERT design, but with the RoBERTa training protocol—specifically, the next-sentence objective is omitted, and dynamic masking with longer, more aggressive training is adopted.

Variant	Layers	Hidden Size	Self-Attention Heads	Parameters (M)
PhoBERT\textsubscript{base}	12	768	12	~135
PhoBERT\textsubscript{large}	24	1024	16	~370

Token embeddings comprise token type, positional, and (optionally) segment embeddings. The architecture leverages a word-level pre-segmentation before subword tokenization, a key adaptation for Vietnamese (Nguyen et al., 2020).

2. Pretraining Corpora and Tokenization Strategies

PhoBERT is trained on a 20 GB Vietnamese corpus, constructed from approximately 1 GB of the Vietnamese Wikipedia and a rigorously deduplicated 19 GB news crawl. All text is first segmented at the word level using VnCoreNLP's RDRSegmenter. Subword tokenization is then applied using fastBPE, producing a 64,000-type vocabulary. On average, PhoBERT processes 24.4 sub-word tokens per sentence (Nguyen et al., 2020).

A crucial design decision is the explicit word-level segmentation prior to BPE tokenization. This strategy is motivated by the morpho-syllabic structure of Vietnamese and contrasts with models such as XLM-R, where BPE is applied at the syllable level. The explicit segmentation is empirically linked to PhoBERT’s improved word-level performance on downstream Vietnamese NLP tasks (Nguyen et al., 2020).

3. Pretraining Objectives, Optimization, and Training Regime

PhoBERT’s sole pretraining objective is masked language modeling (MLM), adhering to the RoBERTa protocol. Given a set $M$ of masked positions, the objective is: $L_{MLM} = - \sum_{i \in M} \log P(x_i | x_{ \setminus M })$ Training is performed using fairseq’s RoBERTa codebase with the Adam optimizer. The maximum sequence length is 256 subword tokens. PhoBERT\textsubscript{base} is trained with a batch size of 1024 sentences, peak learning rate of $4 \times 10^{-4}$ , for approximately 540,000 steps over 3 weeks. PhoBERT\textsubscript{large} uses a batch size of 512, peak learning rate of $2 \times 10^{-4}$ , for approximately 1,080,000 steps over 5 weeks. A total of 40 epochs with a 2-epoch linear warm-up schedule are used (Nguyen et al., 2020).

4. Downstream Benchmark Performance and Comparative Evaluation

PhoBERT establishes new state-of-the-art results across multiple Vietnamese sequence and classification tasks, outperforming both prior monolingual models and large multilingual models such as XLM-R. The table below summarizes key test set results:

Model	POS Acc.	LAS / UAS	NER F₁	NLI Acc.
XLM-R\textsubscript{base}	96.2	76.46 / 83.10	92.0	75.4
XLM-R\textsubscript{large}	96.3	75.87 / 82.70	92.8	79.7
PhoBERT\textsubscript{base}	96.7	78.77 / 85.22	93.6	78.5
PhoBERT\textsubscript{large}	96.8	77.85 / 84.32	94.7	80.0

Notable absolute improvements of PhoBERT\textsubscript{large} over XLM-R\textsubscript{large} are observed: +0.5 percentage points in POS, +2.95 (LAS)/+2.62 (UAS) in dependency parsing, +1.9 in NER F₁, and +0.3 in NLI accuracy (Nguyen et al., 2020). These gains persist across 5 random seeds; no significance testing is reported.

On the SMTCE benchmark for social media text classification, PhoBERT achieves a macro-F1 of 65.44% on VSMEC (Vietnamese emotion recognition), outperforming all multilingual baselines, and is highly competitive with other monolingual BERT derivatives (Nguyen et al., 2022).

5. Extensions, Hybrid Architectures, and Domain-Specific Adaptation

PhoBERT's contextual embeddings have been foundational for further architectural innovations. For Vietnamese token-level classification, the TextGraphFuseGAT architecture integrates PhoBERT with Graph Attention Networks (GATs), constructing a fully connected graph over token embeddings. This integration yields micro-F1 and macro-F1 improvements of up to 4 absolute points over vanilla PhoBERT baselines on benchmarks such as PhoNER-COVID19 and VietMed-NER (Nguyen, 13 Oct 2025). The approach demonstrates that combining PhoBERT's sequential representations with explicit token-level relational modeling substantially enhances performance in domain-specific and noisy-data contexts.

In hate speech detection, a PhoBERT-CNN hybrid outperforms prior SOTA and multilingual methods on Vietnamese social media data (e.g., macro-F1 of 67.46% on ViHSD) by augmenting PhoBERT’s outputs with a lightweight convolutional head. Data cleaning, EDA-style augmentation, and Spark-based real-time deployment further establish practical applicability (Tran et al., 2022).

6. Analysis, Limitations, and Directions for Future Work

Layer-wise analyses indicate that PhoBERT\textsubscript{large} underperforms PhoBERT\textsubscript{base} on parsing tasks, plausibly due to diminished syntactic encoding in higher layers, echoing findings from Hewitt & Manning (2019). Further layer-wise probing and analysis are recommended (Nguyen et al., 2020).

Although monolingual PhoBERT surpasses the performance of massively multilingual models trained on much larger corpora (e.g., XLM-R’s 2.5 TB), specialized in-domain pretraining and enhanced objectives (e.g., RTD from ELECTRA) may further boost performance, particularly for informal or social-media datasets (Nguyen et al., 2022).

Future directions also include extending the pretraining corpus, fine-tuning with advanced tokenization schemes to better address the rich morpho-syllabic and informal character of Vietnamese user-generated content, and domain adaptation strategies for medical, social, and legal texts.

7. Impact and Availability

PhoBERT establishes the first large-scale, public, monolingual BERT for Vietnamese, directly enabling advances in sequence labeling, sentiment and hate speech classification, and NER in both general and domain-specific Vietnamese corpora. Its open-source availability in both fairseq and HuggingFace Transformers ensures reproducibility and encourages further experimentation and model refinement (Nguyen et al., 2020). The empirical superiority of PhoBERT over large multilingual models in language-specific applications demonstrates the enduring value of monolingual pretraining in low-resource and morphologically complex languages.

Markdown Report Issue Upgrade to Chat

References (4)

PhoBERT: Pre-trained language models for Vietnamese (2020)

SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese (2022)

An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification (2025)

Vietnamese Hate and Offensive Detection using PhoBERT-CNN and Social Media Streaming Data (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PhoBERT.

PhoBERT: Vietnamese Language Model

1. Model Architecture and Pretraining Paradigms

2. Pretraining Corpora and Tokenization Strategies

3. Pretraining Objectives, Optimization, and Training Regime

4. Downstream Benchmark Performance and Comparative Evaluation

5. Extensions, Hybrid Architectures, and Domain-Specific Adaptation

6. Analysis, Limitations, and Directions for Future Work

7. Impact and Availability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PhoBERT: Vietnamese Language Model

1. Model Architecture and Pretraining Paradigms

2. Pretraining Corpora and Tokenization Strategies

3. Pretraining Objectives, Optimization, and Training Regime

4. Downstream Benchmark Performance and Comparative Evaluation

5. Extensions, Hybrid Architectures, and Domain-Specific Adaptation

6. Analysis, Limitations, and Directions for Future Work

7. Impact and Availability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research