Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MaLLaM -- Malaysia Large Language Model (2401.14680v2)

Published 26 Jan 2024 in cs.CL

Abstract: Addressing the gap in LLM pretrained from scratch with Malaysian context, We trained models with 1.1 billion, 3 billion, and 5 billion parameters on a substantial 349GB dataset, equivalent to 90 billion tokens based on our pretrained Byte Pair Encoding (BPE) tokenizer for a single epoch. MaLLaM contributes to enhanced natural language understanding and generation tasks in the Malay language. Although trained on a smaller dataset of 90 billion tokens, our instruction-tuned MaLLaM models perform competitively. When compared to ChatGPT3.5 and Malaysian Mistral, MaLLaM's instruction-tuned models demonstrate notable proficiency, underscoring the effectiveness of our approach in capturing and understanding the nuances of the Malaysian language. MaLLaM models mark a significant contribution to the field, providing comprehensive language representations grounded in Malaysian context. This endeavor aims to pave the way for enhanced natural language understanding and generation tasks specific to the linguistic nuances present in Malaysia. We discuss the training methodology, dataset composition, and the potential impact of MaLLaM in advancing the capabilities of LLMs within the context of the Malay language. All models released at https://huggingface.co/collections/mesolitica/mallam-6577b59d1e0b436ae75f930f

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Husein Zolkepli (4 papers)
  2. Aisyah Razak (5 papers)
  3. Kamarul Adha (5 papers)
  4. Ariff Nazhan (5 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com