Introduction
The evolution of AI has heralded the emergence of Mistral 7B, a ground-breaking LLM setting new benchmarks in the domain of natural language processing. Despite Mistral 7B's quantum leap in performance, surpassing large models like Llama 2 13B and nearing the proficiency exhibited by CodeLlama 7B, a gap remained in the Malaysian context understanding. This gap has propelled efforts to fine-tune Mistral 7B, resulting in the creation of Malaysian Mistral — a specialized LLM that refines its contextual understanding based on an extensive 32.6 GB dataset tailored to the Malaysian linguistic landscape.
Pre-Training Procedure
The pre-training of Malaysian Mistral involved a multi-faceted approach to data collection. Central to the construction of its corpus was the downloading and processing of the Malay Wikipedia dump, alongside targeted filtering of the English Wikipedia dataset to capture content pertinent to Malaysia. Beyond establishing a linguistic foundation with reputable sources like the Malay dictionary "Kamus Dewan Edisi Keempat," the inclusion of data from the Malaysia Hansard, legal documents, and government public records imparted the model with a grasp of formal and legal discourse. The online articles scraping operation significantly diversified the dataset, ensuring the representation of various facets of Malaysian life. Following data deduplication and postprocessing, the model training employed the causal LLM approach, utilizing powerful GPUs and meticulously selected hyperparameters.
Fine-tuning Procedure
The deployment of ChatGPT3.5, ChatGPT4, and neural machine translation facilitated the crafting of instructive datasets integral to fine-tuning. The generation of synthetic question-answer pairs leveraged for open-source QA datasets, chat instructions, and coding queries signified a targeted initiative to elevate Malaysian Mistral's capabilities in handling multifaceted tasks. The model was fine-tuned with a 16384-context length to enhance performance on instruction-based tasks, employing the chat template developed by Mistral.
Evaluation and Conclusion
Benchmarking Malaysian Mistral against formidable models like ChatGPT3.5 and Claude 2 highlighted its superiority on bespoke Tatabahasa tests. Using the fine-tuned model to process natural language questions, the model exhibited a consistent and impressive ability to discern and answer queries, despite the complexity of the Malay language's grammar and contextual nuance. The release of the models and their source code facilitates public and private sector advancements, positioning Malaysian Mistral as a vital AI asset, epitomizing the commitment to providing state-of-the-art tools to the Malaysian tech community and beyond. The forward-thinking ambition to develop an open-source multi-modal model emphasizes the ongoing effort to push the boundaries of AI, leveraging current successes as a stepping stone to future innovations.