PhoGPT: Generative Pre-training for Vietnamese (2311.02945v3)

Published 6 Nov 2023 in cs.CL

Abstract: We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20480 token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. In addition, we also demonstrate its superior performance compared to previous open-source models. Our PhoGPT models are available at: https://github.com/VinAIResearch/PhoGPT

Authors (6)

Dat Quoc Nguyen (55 papers)
Linh The Nguyen (8 papers)
Chi Tran (6 papers)
Dung Ngoc Nguyen (2 papers)
Dinh Phung (147 papers)
Hung Bui (23 papers)

Citations (6)

View on Semantic Scholar

Summary

PhoGPT: Generative Pre-training for Vietnamese

This paper presents PhoGPT, a state-of-the-art generative pre-trained transformer model tailored for the Vietnamese language. The authors introduce two main components: PhoGPT-4B, which serves as the base model, and PhoGPT-4B-Chat, a variant fine-tuned for conversational tasks. With 3.7 billion parameters, the models are designed to leverage a comprehensive Vietnamese corpus, aiming to provide advanced LLMing capabilities for Vietnamese NLP tasks.

Model Development

PhoGPT-4B is built upon a Transformer decoder architecture with innovations such as Triton flash attention and ALiBi for effective context length extrapolation. It utilizes a byte-level BPE tokenizer with a vocabulary size of 20,480 tokens and is trained on a robust dataset of 102 billion tokens. This corpus covers diverse text types from Wikipedia to legal and medical texts, ensuring a broad linguistic foundation. The detailed token breakdown in the pre-training corpus signifies a commitment to covering various domains integral to understanding Vietnamese language nuances.

The fine-tuning process to create PhoGPT-4B-Chat involved a dataset of 70,000 instructional prompts with responses and 290,000 conversational exchanges. By drawing on datasets such as Bactrian-X and UltraChat, the model is optimized to handle a variety of conversational interactions in Vietnamese.

Performance Evaluation

PhoGPT-4B-Chat's performance was assessed against both closed-source models like Gemini Pro and GPT-3.5-turbo and open-source models such as Vistral-7B-Chat. Using the ViTruthfulQA dataset, which comprises 199 Vietnamese truthful questions, the model's accuracy demonstrated its competitiveness, achieving notable scores compared to larger 7B-parameter models. Specifically, PhoGPT-4B-Chat showed robust performance on Vietnam-specific questions, highlighting its specialized language capabilities.

Implications and Future Directions

The introduction of PhoGPT models is significant for Vietnamese NLP, offering an open-source alternative that aligns with the linguistic and cultural context of Vietnamese users. The models present potential for various applications, from natural language understanding to conversational AI. However, the authors acknowledge limitations in areas like reasoning and coding, which suggests avenues for future improvement and research.

In essence, PhoGPT opens pathways for further exploration in Vietnamese generative LLMs, encouraging advancements and broader participation in regional AI development. Future work can focus on expanding the model's capabilities and evaluating it across diverse real-world tasks to enhance applicability and reliability.

PDF Markdown

GitHub

GitHub - VinAIResearch/PhoGPT: PhoGPT: Generative Pre-training for Vietnamese (2023) (732 stars)