PhoGPT: Vietnamese LLM Benchmark
- PhoGPT is an open-source series of Vietnamese generative language models built with transformer architecture, serving as a benchmark for local NLP tasks.
- It utilizes a 3.7B parameter, decoder-only transformer with optimized long-context attention and a custom byte-level BPE tokenizer for Vietnamese.
- The model excels in conversational agents, document processing, and domain-specific QA, advancing research and practical applications in Vietnamese NLP.
PhoGPT refers to a class of large, open-source Vietnamese generative LLMs based on the Transformer architecture, designed and released by VinAI Research in 2023. PhoGPT comprises both a base pre-trained monolingual model (PhoGPT-4B) and an instruction-following chat variant (PhoGPT-4B-Chat), establishing itself as a benchmark series for Vietnamese LLMing. The models are engineered specifically for the Vietnamese language, utilizing a large-scale, diverse dataset and benefiting from state-of-the-art advances in Transformer implementations and efficient training strategies (Nguyen et al., 2023).
1. Model Architecture and Tokenization
PhoGPT-4B is a decoder-only transformer that follows the standard transformer decoder stack, augmented with modern optimizations for large-scale training and inference. The key architectural details include:
- Parameter Count: 3.7 billion (often rounded to "4B" for family designation).
- Depth and Width: 32 transformer layers (), model dimension , multi-head attention with 24 heads.
- Attention Optimizations: Incorporates Triton-based flash attention, enabling memory-efficient high-throughput operations and scalability to long contexts, as well as ALiBi (Attention with Linear Biases), which enables robust extrapolation to longer sequences.
- Context Window: Maximally 8192 tokens, supporting both extended prompt chaining and long-form document modeling.
- Tokenizer: A byte-level BPE tokenizer tailored for Vietnamese with a vocabulary of 20,480 token types. This choice balances coverage of Vietnamese morphological variation, word segmentation, and semantic preservation by reducing over-segmentation common with smaller vocabulary sizes.
The model complies with engineering standards for integration within popular frameworks such as "transformers", "vLLM", and "llama.cpp".
2. Training Data and Pre-training Regimen
PhoGPT-4B is trained from scratch on an extensive and diverse Vietnamese corpus comprising 102 billion tokens (totalling 482 GB post-cleaning and deduplication):
- Text Sources Include:
- 1GB Vietnamese Wikipedia dump (May 2023 snapshot)
- 1.5GB medical texts (various public/clinical sources)
- 3GB literature corpus (public domain books)
- 12GB legal documents (from Vietnamese law databases)
- 40GB news corpus variant
- 88GB OSCAR-2301 Vietnamese data
- 336GB Vietnamese subset of mC4
Two full epochs are performed over the dataset to ensure sufficient token exposure for rare and complex linguistic phenomena.
Fine-tuning (PhoGPT-4B-Chat):
- Instruction Tuning: 70,000 human-generated prompt-response pairs spanning poetry, essays, corrections, summarizations, Q&A, etc.
- Conversational Tuning: 290,000 Vietnamese dialogue samples from Bactrian-X (67K), ChatAlpaca (20K), ShareGPT (40K, filtered for non-code/mathematical), and UltraChat (230K), all either originally in or translated into Vietnamese.
This multi-stage fine-tuning enables enhanced conversational fluency, instruction-following, and context-appropriate response generation for Vietnamese users.
3. Performance Evaluation and Comparative Benchmarks
PhoGPT-4B-Chat is empirically benchmarked against both open-source and commercial models:
- ViTruthfulQA Evaluation: On a 199-question Vietnamese question-answering dataset, the model achieves an overall accuracy of 41.7% (83/199 questions correct), with leading results (43.5%, 64/147) on Vietnam-specific categories. Tasks involve fact verification and contextually specialized knowledge.
- Comparison Set: GPT-4-0125-preview, GPT-3.5-turbo, Gemini Pro 1.0, and multiple open-source Vietnamese models.
- Relative Strengths: PhoGPT-4B-Chat is either competitive with or outperforms commercial models in questions requiring deep understanding of Vietnamese context and culture.
Empirical metrics used focus on accuracy, with task-specific variants as relevant for open-domain QA and cultural knowledge probes.
4. Applications and Deployment Potential
PhoGPT is designed for a broad set of Vietnamese NLP tasks:
- Conversational Agents: Chatbots for customer service, personal assistants, and task automation.
- Document Processing: Summarization, translation, and information retrieval in long and complex Vietnamese texts by leveraging the 8192-token context capability.
- Domain-Specific Question Answering: Applications in legal, medical, and news reporting domains based on the model's exposure to specialized corpora.
- Education and Content Creation: Supporting instructional tools, educational chatbots, spelling/grammar correction, and creative writing.
The model's open-source nature and modular exportability enable integration within both research and production systems, fostering further fine-tuning and adaptation.
5. Technical Contributions and Implementation Strategies
PhoGPT’s architecture and approach reflect several technical advances:
- Efficient Long-Context Attention: The combination of flash attention and ALiBi yields practical training and inference at context sizes previously deemed operationally challenging.
- Custom Language-Specific Preprocessing: The byte-level BPE tokenizer explicitly addresses Vietnamese diacritics, compound words, and proper segmentation—allowing the vocabulary of 20,480 types to balance granularity and modeling efficiency.
- Scalability: Designed for compatibility with distributed and hardware-accelerated training/inference stacks, supporting deployment ranging from consumer hardware (via quantized “llama.cpp” inference) to high-throughput clusters.
- Fine-tuning at Scale: The multi-source, large-scale instructional and conversational data makes PhoGPT especially responsive in few-shot and instructional scenarios.
6. Availability and Community Impact
PhoGPT-4B and PhoGPT-4B-Chat are distributed under an open-source license:
- Model Access: All model weights and inference code are available via the project repository: https://github.com/VinAIResearch/PhoGPT
- Ecosystem Integration: Ready compatibility with major LLM deployment frameworks.
- Research Utility: As the leading Vietnamese-specific LLM, PhoGPT serves as a baseline for research in low-resource LLMing, cross-lingual transfer, dialogue systems, and instruction tuning in non-English contexts.
The model’s open-source release is expected to stimulate advances in Vietnamese NLP, resource-efficient modeling for low-resource languages, and broader Southeast Asian NLP infrastructure development.
7. Context and Significance in the LLM Landscape
PhoGPT is positioned as the de facto Vietnamese open LLM series, analogous in scope to Llama for English and multi-lingual contexts. Its design principles—training from scratch on a curated, massive, language-specific corpus; large context handling; and modular fine-tuning—reflect broader shifts towards high-performance, culturally-relevant LLMs in global NLP research and development.