WangchanBERTa: Pretraining transformer-based Thai Language Models (2101.09635v2)

Published 24 Jan 2021 in cs.CL

Abstract: Transformer-based LLMs, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Moreover, large-scale multi-lingual pretraining does not take into account language-specific features for Thai. To overcome these limitations, we pretrain a LLM based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size), curated from diverse domains of social media posts, news articles and other publicly available datasets. We apply text processing rules that are specific to Thai most importantly preserving spaces, which are important chunk and sentence boundaries in Thai before subword tokenization. We also experiment with word-level, syllable-level and SentencePiece tokenization with a smaller dataset to explore the effects on tokenization on downstream performance. Our model wangchanberta-base-att-spm-uncased trained on the 78.5GB dataset outperforms strong baselines (NBSVM, CRF and ULMFit) and multi-lingual models (XLMR and mBERT) on both sequence classification and token classification tasks in human-annotated, mono-lingual contexts.

PDF Abstract

An Overview of WangchanBERTa: Pretraining Transformer-based Thai LLMs

The paper "WangchanBERTa: Pretraining Transformer-based Thai LLMs" presents a targeted approach to address the complexities associated with LLMing for Thai, a comparatively low-resource language. The research focuses on overcoming the limitations introduced by small dataset sizes and the inadequacy of fine-tuning multi-lingual models by pretraining a LLM specifically optimized for Thai. The model architecture is based on RoBERTa, an extension of BERT, that is acknowledged for its robust pretraining framework.

Methodological Advancements

The principal contribution of this research is the development of a Thai LLM called wangchanberta-base-att-spm-uncased. This model is pretrained on a substantial 78.5GB corpus, which encompasses texts from diverse sources like social media, news articles, and other publicly accessible datasets. An essential step in this process was the deduplication and cleaning of these datasets to ensure high-quality input data. The paper emphasizes the importance of preserving spaces before subword tokenization as these mark chunk and sentence boundaries in Thai, crucial for effective language preprocessing.

The research introduces various tokenization strategies, experimenting with word-level, syllable-level, and SentencePiece tokenization. The results of these experiments provide insights into the effects of different tokenization methods on the model's performance in downstream tasks, reflecting a nuanced understanding of language-specific intricacies.

Numerical Results and Performance Metrics

The wangchanberta model showcases superior performance over several established baselines, including NBSVM, CRF, and ULMFit, as well as multi-lingual models such as XLMR and mBERT. Specifically, the model demonstrates enhanced results in both sequence classification and token classification tasks, setting a new benchmark for Thai language processing in these contexts. This performance underscores the efficacy of a tailored, language-specific pretrained model architecture over broader, multi-lingual models or models trained on inadequately sized datasets.

Implications and Future Directions

From a theoretical perspective, the results underline the tangible benefits of custom LLMs for low-resource languages by leveraging large-scale pretraining. Practically, it suggests a refined focus on language-specific preprocessing can markedly influence performance; a key observation for future model development in other low-resource languages.

The success of WangchanBERTa implies several avenues for future research. Extending this approach to other low-resource languages could replicate or even surpass the advances evidenced for Thai. Further exploration of optimal tokenization strategies tailored to individual languages may also yield incremental improvements in modeling accuracy and efficiency.

In sum, this research provides an important contribution to Thai language processing by demonstrating how transformer-based models like RoBERTa can be adapted and optimized for specific linguistic contexts. It presents a viable pathway for advancing natural language processing for underrepresented languages, marking a pivotal shift towards more inclusive LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Lalita Lowphansirikul (4 papers)
Charin Polpanumas (6 papers)
Nawat Jantrakulchai (1 paper)
Sarana Nutanong (14 papers)

Citations (66)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - vistec-AI/thai2transformers: Pretraining transformer based Thai language models (121 stars)

Tweets

https://twitter.com/BHamadicharef/status/1767472125252956295