Typhoon: Thai Large Language Models (2312.13951v1)

Published 21 Dec 2023 in cs.CL and cs.AI

Abstract: Typhoon is a series of Thai LLMs developed specifically for the Thai language. This technical report presents challenges and insights in developing Thai LLMs, including data preparation, pretraining, instruction-tuning, and evaluation. As one of the challenges of low-resource languages is the amount of pretraining data, we apply continual training to transfer existing world knowledge from a strong LLM. To evaluate the Thai knowledge encapsulated in each model from the pretraining stage, we develop ThaiExam, a benchmark based on examinations for high-school students and investment professionals in Thailand. In addition, we fine-tune Typhoon to follow Thai instructions, and we evaluate instruction-tuned models on Thai instruction datasets as well as translation, summarization, and question-answering tasks. Experimental results on a suite of Thai benchmarks show that Typhoon outperforms all open-source Thai LLMs, and its performance is on par with GPT-3.5 in Thai while having only 7 billion parameters and being 2.62 times more efficient in tokenizing Thai text.

References (63)

Authors (7)

Citations (14)

View on Semantic Scholar

Summary

Introduction to Typhoon

In the field of LLMs, various models like GPT and Llama have established impressive benchmarks in NLP tasks, heavily relying on English-centric datasets. However, for the Thai language—a low-resource language spoken by over 70 million people—the landscape is sparse, as its representation in global data sets is minimal. Recognizing the linguistic and cultural richness of Thai that existing multilingual and general LLMs might not fully capture, the introduction of Typhoon, a 7-billion parameter Thai LLM, is a significant development. Typhoon aims to encapsulate Thai knowledge better by pretraining on a specifically curated Thai corpus.

Challenges and Data Preparation

Developing Thai-specific LLMs faces unique hurdles, chiefly the scarcity of large-scale Thai data. Typhoon's creators tackled this by sourcing Thai internet data, which constitutes less than half a percent of the Common Crawl data, and addressed quality issues by applying rigid deduplication methods. A meticulous methodology — from increasing dataset size to implementing heuristic filtering — led to a refined 3 TB Thai corpus integrated with English data to preserve English-language skills. They also devised a more efficient tokenizer for Thai text, achieving 2.62 times better efficiency than GPT-4.

Evaluations and Results

The evaluation of Typhoon’s Thai knowledge was meticulously conducted using ThaiExam, an innovative benchmark comprising high-school and professional examinations in Thailand. Furthermore, its capability to follow Thai instructions was rigorously tested against a suite of Thai benchmarks, where Typhoon outperformed existing open-source models and even held ground against proprietary models like GPT-3.5 while demonstrating greater efficiency in tokenizing Thai text.

Instruction Tuning and Future Directions

Due to the lack of Thai instruction-tuning data, an innovative approach was adopted by translating English datasets to Thai or generating instruction sets using the Self-Instruct technique. Through such Supervised Fine-Tuning (SFT), Typhoon was aligned closely with user intent. In zero-shot NLP tasks, Typhoon's performance was superior across a majority of metrics compared to other models. Moving forward, expansions in pretraining data and the application of larger base models could enhance Typhoon's capabilities, which is made possible as Typhoon is publicly available under the Apache-2.0 license, encouraging further research and development.

PDF Markdown

Tweets

https://twitter.com/1331663474888376321/status/1742383962352460190

https://twitter.com/1653675273529720833/status/1739666519204712594

YouTube

Show All Videos