Xmodel-LM Technical Report (2406.02856v5)

Published 5 Jun 2024 in cs.CL and cs.AI

Abstract: We introduce Xmodel-LM, a compact and efficient 1.1B LLM pre-trained on around 2 trillion tokens. Trained on our self-built dataset (Xdata), which balances Chinese and English corpora based on downstream task optimization, Xmodel-LM exhibits remarkable performance despite its smaller size. It notably surpasses existing open-source LLMs of similar scale. Our model checkpoints and code are publicly accessible on GitHub at https://github.com/XiaoduoAILab/XmodeLLM.

Authors (6)

Yichuan Wang (12 papers)
Yang Liu (2253 papers)
Yu Yan (54 papers)
Xucheng Huang (2 papers)
Ling Jiang (8 papers)
Qun Wang (146 papers)

Citations (1)

View on Semantic Scholar

Summary

Overview of Xmodel-LM: A Technical Report

The paper "Xmodel-LM Technical Report" by Wang Yichuan, Liu Yang, Yan Yu, Huang Xucheng, and Jiang Ling from XiaoduoAI introduces Xmodel-LM, a compact 1.1-billion parameter LLM. Xmodel-LM is distinguished by its efficient architecture and its training on a balanced Chinese and English dataset, Xdata, encompassing over 2 trillion tokens. Despite its relatively modest size, Xmodel-LM exhibits notable performance across multiple benchmarking tasks, surpassing several contemporary open-source LLMs of comparable scale.

Pretraining Process

The pretraining of Xmodel-LM is meticulously detailed in the report. The dataset, Xdata, aggregates sources from various LLMs, including Redpajama, subsets of the Pile, and StarCoder, alongside other sources to ensure a balanced representation of both Chinese and English corpora. The dataset incorporates additional sources such as FanFics, OpenWebMath, PTD, and WanJuan to bolster specific content areas. The data preprocessing pipeline involves initial heuristic filtering, quality filtering using a 5-gram Kneser-Ney model, and deduplication via SimHash-based locality-sensitive hashing.

The tokenizer for Xmodel-LM is built using the unigram algorithm trained on a mixed corpus of Chinese and English, culminating in a relatively compact vocabulary size of 32,000. This tokenizer demonstrates superior compression rates on test data as compared to other models with larger vocabularies.

In terms of model architecture, Xmodel-LM bears similarities to LLAMA 2, featuring a hidden size of 2048, intermediate size of 5632, 32 attention heads, 4 KV heads, 24 layers, and a context length of 4096. The network integrates rotary positional embeddings, RMSNorm normalization, SwiGLU activation, and grouped-query attention to enhance training stability and performance. Training is executed using 8×H800 GPUs, leveraging techniques like Distributed Data Parallel (DDP) and FlashAttention-V2, with a comprehensive training regime involving cumulatively updated gradients and a cosine learning rate schedule governed by the AdamW optimizer.

Evaluation Results

Commonsense Reasoning

Xmodel-LM's evaluation included multiple commonsense reasoning tasks using the LLM Evaluation Harness, such as ARC-Challenge, ARC-Easy, Boolq, HellaSwag, OpenBookQA, PiQA, SciQ, TriviaQA, and Winogrande. Xmodel-LM surpassed several baseline models, particularly outperforming TinyLLaMA and demonstrating performance levels comparable to Qwen1.5 on many metrics.

Problem-Solving Tasks

To further analyze Xmodel-LM’s capabilities beyond commonsense reasoning, evaluations on problem-solving tasks were conducted. These included datasets such as BBH, GLUE, GSM8K, and MMLU. Notably, Xmodel-LM achieved superior performance on the BBH tasks compared to baseline models, showcasing strong competitiveness overall.

Chinese Language Proficiency

Given the bilingual nature of its training data, Xmodel-LM’s Chinese language proficiency was also assessed using datasets like ARC-zh, XCOPA-zh, and XNLI-zh. While Xmodel-LM demonstrated commendable Chinese language understanding and generation capabilities, it was observed to be slightly behind some of the larger models such as InternLM2 and Qwen1.5.

Observations and Implications

The report includes a case paper section discussing the evolution of Xmodel-LM’s performance. The model exhibits an approximate linear relationship between log iteration steps and metric gains across most tasks, suggesting stable improvements as training progresses. Additionally, the paper of the $L_2$ -norm of model parameters revealed a phase-like division similar to phenomena observed in past research. The transitions from memorization to generalization phases suggest the model's progression from overfitting to learning generalized patterns.

Conclusions and Future Directions

The findings from the Xmodel-LM project underscore the feasibility and effectiveness of training smaller models with extensive, balanced datasets for achieving competitive performance. Xmodel-LM's efficient architecture and robust preprocessing pipeline offer insights into the potential for developing compact yet powerful LLMs. Future research could explore scaling Xmodel-LM's architecture while maintaining efficiency, expanding the diversity and size of training datasets, and further optimizing training strategies to leverage multilingual capabilities. The public availability of Xmodel-LM’s code and model checkpoints on GitHub aids in fostering reproducibility and further research in the domain.

Overall, Xmodel-LM represents a significant step towards achieving scalable, efficient, and multilingual LLMs, providing valuable contributions to both theoretical foundations and practical applications in NLP.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1798547403680915729

https://twitter.com/sawubonagmbh/status/1907449077060780406