PhoGPT-4B-Chat: Vietnamese Conversational AI

Updated 21 July 2025

PhoGPT-4B-Chat is a transformer-based Vietnamese conversational model offering robust dialogue capabilities through extensive pre-training and fine-tuning.
It leverages an 8,192-token context window with flash attention and ALiBi for efficient long-sequence processing and coherent, context-rich responses.
Evaluations on local Vietnamese benchmarks demonstrate that it outperforms competing models, setting a new standard for language-specific AI performance.

PhoGPT-4B-Chat is an open-source, large-scale Vietnamese conversational LLM and a central member of the PhoGPT model suite, developed to provide high-quality Vietnamese natural language understanding and generation through targeted generative pre-training and conversational fine-tuning. Distinguished by its transformer-based decoder architecture, PhoGPT-4B-Chat is designed for robust, instruction-following dialogue capacities in the Vietnamese language, leveraging extensive pre-training on a massive monolingual corpus and further refined through supervised fine-tuning using a curated corpus of instructional and conversational data. This positions PhoGPT-4B-Chat as the highest-performing open Vietnamese conversational LLM according to current systematic evaluations, establishing a new benchmark for Vietnamese language AI systems (Nguyen et al., 2023).

1. Model Architecture and Pre-training

PhoGPT-4B-Chat is built upon a transformer decoder architecture following modern design principles for autoregressive LLMs. The base model, PhoGPT-4B, consists of exactly 3.7 billion parameters—referred to as "4B"—structured into 32 layers, each with 24 attention heads and a hidden size of 3,072.

A core feature is its 8,192-token context window, optimized for long-sequence processing using advanced methods such as flash attention (Triton implementation) for computational efficiency and reduced memory consumption. The model uses a Vietnamese-specific byte-level Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 20,480, providing granular coverage of Vietnamese orthography and reducing the risk of tokenization-induced artifacts.

Pre-training was conducted from scratch on a deduplicated corpus comprising approximately 102 billion Vietnamese tokens, drawing on a diverse set of domains—Wikipedia, books, legal documents, news, medical texts, and large-scale web-crawled sources (notably variants of OSCAR and mC4). Training was executed over two epochs, which is significant for deep linguistic coverage and helps mitigate overfitting (Nguyen et al., 2023).

2. Supervised Fine-Tuning: Instructional and Conversational Data

PhoGPT-4B-Chat is derived from PhoGPT-4B via extensive supervised fine-tuning. The fine-tuning dataset is composed of 70,000 instruction–response pairs and an additional 290,000 conversational exchanges, including but not limited to:

Direct instructional tasks (such as poem writing, summarization, and context-based Q&A)
Vietnamese adaptations of established datasets: Bactrian-X, ChatAlpaca, ShareGPT, and UltraChat

This combination ensures domain diversity and improved generalization to a wide range of dialogue tasks. The prompts were selected to represent a broad range of practical conversational situations, and are specifically tailored for the Vietnamese context. The fine-tuning process employs supervised learning, with feedback from human annotators to further enhance factuality and engagement.

3. Evaluation and Comparative Performance

The model’s evaluation is centered around the ViTruthfulQA dataset, which contains 199 questions (147 of which are Vietnam-specific). Metrics are based on human annotations, where a response is labeled "correct" only if it is accurate and non-hallucinatory. Responses are generated via greedy decoding to ensure reproducibility across model comparisons.

PhoGPT-4B-Chat achieves an overall accuracy of 41.7% on the full ViTruthfulQA set. Notably, it attains the highest score among evaluated models when restricted to the subset of Vietnam-relevant questions, outperforming proprietary models such as GPT-3.5-turbo and Gemini Pro 1.0, as well as open-source baselines including Vistral-7B-Chat, Sailor-7B/-4B-Chat, and SeaLLM-7B-v2. This is particularly significant in the Vietnamese context, where PhoGPT-4B-Chat’s local language training and data curation allows for enhanced coverage and accuracy relative to multilingual or English-centric models (Nguyen et al., 2023).

4. Technical Innovations

Several technical features distinguish PhoGPT-4B-Chat:

Flash Attention (Triton): Accelerates attention computation for long sequences without loss of accuracy.
ALiBi (Attention with Linear Biases): Enables context-length extrapolation so the model can generalize to sequence lengths beyond those seen during training.
Vietnamese-specific tokenizer: A byte-level BPE approach reduces vocabulary inefficiencies and ensures fine-grained tokenization for Vietnamese script.
Compatibility with industry-standard deployment frameworks such as “transformers,” “vllm,” and “llama.cpp.”
Training at scale enabled and managed by tools like MosaicML’s LLM-foundry.

These elements allow PhoGPT-4B-Chat to deliver state-of-the-art long-form dialogue, maintain context over extended turns, and process nuanced Vietnamese input effectively (Nguyen et al., 2023).

5. Applications and Real-World Deployment

PhoGPT-4B-Chat is engineered to address a spectrum of Vietnamese NLP tasks, including but not limited to:

Conversational agents for customer support and virtual assistants
Educational question answering and summarization
Content creation, instructional guidance, and information retrieval
Specialized domains where precise Vietnamese knowledge is needed, such as legal or medical consultation

Demonstrated use cases include educational platforms, legal advisory tools, and medical information provision, all leveraging the model’s robust native-language understanding. Support for deployment through major open-source frameworks facilitates integration into research, commercial, and governmental systems.

6. Limitations and Future Directions

The model inherits certain limitations typical of current generative LLMs:

Remaining gaps in abstract reasoning, multi-step mathematical queries, and complex coding tasks
Factuality: While leading among Vietnamese open-source models, PhoGPT-4B-Chat does not eliminate hallucinations or unsafe outputs
Safety and bias: The fine-tuning process claims improved guardrails, but ongoing supervision is suggested to ensure ethical use

Planned directions for the PhoGPT series include expanded fine-tuning datasets, additional domain adaptation, and exploration of multi-modal and retrieval-augmented architectures. Ongoing benchmarking against new LLMs will guide updates to meet evolving application requirements (Nguyen et al., 2023).

7. Broader Context and Significance

PhoGPT-4B-Chat constitutes a significant contribution within a lineage of LLMs transitioning from generic English-centric LLMs toward regionally- and linguistically-adapted systems. It is the result of a dedicated, from-scratch pre-training effort tailored to Vietnamese linguistic features, corpus diversity, and real-world application demands. The open release lowers resource barriers for Vietnamese NLP research and encourages further community-driven model improvements and benchmarking.

Feature	PhoGPT-4B-Chat	Notable Comparison
Parameters	3.7B (“4B” naming)	Vistral-7B, Sailor-7B, GPT-3.5
Training corpus size	102B Vietnamese tokens	Varies; often smaller or cross-lingual
Context window	8,192 tokens (w/ flash attn/ALiBi)	Usually 2k–4k (non-ALiBi)
ViTruthfulQA accuracy	41.7% (all), state-best (VN-specific)	Often lower, especially on local knowledge
Tokenizer	Vietnamese byte-level BPE	Multilingual or generic BPEs

Conclusion

PhoGPT-4B-Chat sets a new benchmark for Vietnamese conversational AI through careful architectural choices, large-scale monolingual pre-training, and targeted fine-tuning. Its design and open availability address the needs of Vietnamese NLP practitioners across academia, industry, and government, while simultaneously informing the broader field on the importance of language-specific LLMs (Nguyen et al., 2023).

PDF Markdown Chat (Pro)

References (1)

PhoGPT: Generative Pre-training for Vietnamese (2023)

Follow Topic

Get notified by email when new papers are published related to PhoGPT-4B-Chat.