Xmodel-1.5: An 1B-scale Multilingual LLM (2411.10083v3)

Published 15 Nov 2024 in cs.CL

Abstract: We introduce Xmodel-1.5, a 1-billion-parameter multilingual LLM pretrained on 2 trillion tokens, designed for balanced performance and scalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy. The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English, outperforming Alibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in benchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai. To support low-resource language research, we release Xdata_Thai, a Thai-specific evaluation dataset featuring unique linguistic challenges such as gendered particles and idioms. While the model demonstrates strong performance, there is still room for improvement in handling culturally specific nuances. We hope this work contributes to advancements in multilingual AI research. Models and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodeLLM-1.5

PDF HTML Abstract

An Expert Overview of Xmodel-1.5: A 1B-Scale Multilingual LLM

The paper presents the development and evaluation of Xmodel-1.5, a multilingual LLM consisting of 1 billion parameters. The model was trained on a vast corpus comprising 2 trillion tokens from diverse sources to enhance its performance across multiple languages, with particular effectiveness in languages such as Thai, Arabic, French, Chinese, and English. This document discusses the key components of the Xmodel-1.5 architecture, pretraining mechanisms, tokenization strategy, and its evaluation against other contemporary LLMs.

Pretraining and Tokenization

Xmodel-1.5's pretraining focuses on a multilingual environment, drawing on data from the Multilang Wiki and CulturaX datasets. These datasets provide comprehensive coverage across both high-resource and low-resource languages, with specific efforts to over-sample languages that typically lack data, including Mongolian, Burmese, and Tamil. The dataset is complemented by additional Chinese data and curated Thai content, ensuring a robust cross-linguistic training foundation.

The paper describes the use of a unigram tokenizer, trained via SentencePiece to create a vocabulary size of 65,280 tokens. This approach is chosen over byte pair encoding due to its greater flexibility in handling low-frequency tokens and enabling faster training while maintaining language nuance and token efficiency. Furthermore, specific considerations were made in handling whitespace, which is critical for compression in code data.

Model Architecture and Training

The architecture builds on rotary positional embeddings and employs RMSNorm for input normalization, combined with SwiGLU for non-linearity. Grouped-query attention (GQA) is implemented to enhance training and inference efficiency. Training is conducted over 600,000 iterations on H800 GPUs, using AdamW as the optimizer with a meticulously outlined learning rate schedule.

An intriguing aspect is the dynamic allocation of multilingual data during training, rising from an initial 5% to 10%. This strategic increase reflects the importance of language diversification in pretraining, promoting broader multilingual proficiency.

Evaluation and Results

Evaluation of Xmodel-1.5 includes multiple commonsense reasoning tasks using standard benchmarks such as ARC, Boolq, and HellaSwag. Comparisons with other models like OPT, Pythia, and InternLM2 illustrate Xmodel-1.5's competitive performance, particularly surpassing TinyLLaMA across several metrics. Despite these successes, models like Qwen2.5-1.5B still outperform Xmodel-1.5 in some benchmarks, highlighting areas for potential improvement.

The model's multilingual capabilities are examined through various translated tasks, showing notable proficiency as reflected in results presented for tasks such as XCOPA and Belebele_tha_Thai. Performance is gauged against PolyLM, revealing evolutionary improvements throughout the training process.

Additionally, instruction fine-tuning indicates Xmodel-1.5's capability for specialized tasks like e-commerce RAG, achieving a high satisfaction rate in practical applications. The model demonstrates not only its foundational language understanding but also its adaptability to specific domains through post-training fine-tuning.

Implications and Future Prospects

Xmodel-1.5 exemplifies the progress in crafting efficient, multilingual models that address the pressing demand for cross-cultural and cross-linguistic AI systems. Its publicly available code and model facilitate further research in NLP, aiding in the exploration of more sophisticated and inclusive AI technologies.

The research community can draw several implications from this work, notably the feasibility of achieving robust language understanding in lower-resource environments without resorting to exceedingly large parameter sizes. Future research might consider expanding the language scope further or optimizing the balance between model size and performance to adapt to an even broader range of applications and linguistic contexts.

In conclusion, Xmodel-1.5 represents a significant contribution to multilingual NLP, demonstrating the effectiveness of strategic data sourcing and architectural design in accommodating global linguistic diversity. Such advances propelling the capabilities of multilingual LLMs will foster ongoing discourse and innovation within the computational linguistics sphere, contributing toward the ultimate aim of seamless cross-linguistic communication.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Wang Qun (3 papers)
Liu Yang (195 papers)
Lin Qingquan (2 papers)
Jiang Ling (2 papers)

Related Papers

Baichuan 2: Open Large-scale Language Models (2023)
Unsupervised Cross-lingual Representation Learning at Scale (2019)
Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca (2023)
Xmodel-LM Technical Report (2024)
YuLan: An Open-source Large Language Model (2024)

Find Related Papers

GitHub

GitHub - XiaoduoAILab/XmodelLM: XmodelLM (12 stars)

Tweets

https://twitter.com/gm8xx8/status/1858603509739909266

https://twitter.com/IAMJBDEL/status/1860105721859412051