Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding (2401.13565v3)

Published 24 Jan 2024 in cs.CL

Abstract: In this paper, we present significant advancements in the pretraining of Mistral 7B, a large-scale LLM, using a dataset of 32.6 GB, equivalent to 1.1 billion tokens. We explore the impact of extending the context length, releasing models with context lengths of 4096 and 32768 tokens, and further refining performance with a specialized 16384 context length instruction-tuned model, we called it Malaysian Mistral. Our experiments demonstrate the efficacy of continue pretraining and the influence of extended context lengths on Mistral 7B's language understanding capabilities. Additionally, we release a model specifically tuned with a 16384 context length instruction, showcasing its potential for capturing nuanced language intricacies. Furthermore, our research contributes to the benchmarking of Malaysian Mistral against prominent LLMs, including ChatGPT3.5 and Claude 2. We present compelling results indicating Malaysian Mistral's superior performance on Tatabahasa (Malay grammar) test set, particularly when fine-tuned with instructions. All models released at https://huggingface.co/collections/mesolitica/malaysian-mistral-7b-6528f2ec825f4bba46c1700c

PDF HTML Abstract

Introduction

The evolution of AI has heralded the emergence of Mistral 7B, a ground-breaking LLM setting new benchmarks in the domain of natural language processing. Despite Mistral 7B's quantum leap in performance, surpassing large models like Llama 2 13B and nearing the proficiency exhibited by CodeLlama 7B, a gap remained in the Malaysian context understanding. This gap has propelled efforts to fine-tune Mistral 7B, resulting in the creation of Malaysian Mistral — a specialized LLM that refines its contextual understanding based on an extensive 32.6 GB dataset tailored to the Malaysian linguistic landscape.

Pre-Training Procedure

The pre-training of Malaysian Mistral involved a multi-faceted approach to data collection. Central to the construction of its corpus was the downloading and processing of the Malay Wikipedia dump, alongside targeted filtering of the English Wikipedia dataset to capture content pertinent to Malaysia. Beyond establishing a linguistic foundation with reputable sources like the Malay dictionary "Kamus Dewan Edisi Keempat," the inclusion of data from the Malaysia Hansard, legal documents, and government public records imparted the model with a grasp of formal and legal discourse. The online articles scraping operation significantly diversified the dataset, ensuring the representation of various facets of Malaysian life. Following data deduplication and postprocessing, the model training employed the causal LLM approach, utilizing powerful GPUs and meticulously selected hyperparameters.

Fine-tuning Procedure

The deployment of ChatGPT3.5, ChatGPT4, and neural machine translation facilitated the crafting of instructive datasets integral to fine-tuning. The generation of synthetic question-answer pairs leveraged for open-source QA datasets, chat instructions, and coding queries signified a targeted initiative to elevate Malaysian Mistral's capabilities in handling multifaceted tasks. The model was fine-tuned with a 16384-context length to enhance performance on instruction-based tasks, employing the chat template developed by Mistral.

Evaluation and Conclusion

Benchmarking Malaysian Mistral against formidable models like ChatGPT3.5 and Claude 2 highlighted its superiority on bespoke Tatabahasa tests. Using the fine-tuned model to process natural language questions, the model exhibited a consistent and impressive ability to discern and answer queries, despite the complexity of the Malay language's grammar and contextual nuance. The release of the models and their source code facilitates public and private sector advancements, positioning Malaysian Mistral as a vital AI asset, epitomizing the commitment to providing state-of-the-art tools to the Malaysian tech community and beyond. The forward-thinking ambition to develop an open-source multi-modal model emphasizes the ongoing effort to push the boundaries of AI, leveraging current successes as a stepping stone to future innovations.