Overview of CPM: A Large-scale Generative Chinese Pre-trained LLM
This paper introduces the Chinese Pre-trained LLM (CPM), a significant contribution to the field of NLP with a focus on Chinese text corpora. CPM is a transformer-based autoregressive LLM, featuring 2.6 billion parameters and trained on 100GB of Chinese data. This model positions itself as the largest Chinese pre-trained LLM to date, aiming to facilitate numerous downstream NLP tasks such as conversation, essay generation, and language understanding.
Motivation and Objectives
The motivation behind CPM stems from the challenges associated with applying existing large-scale models like GPT-3, primarily trained on English corpora, to Chinese NLP tasks. GPT-3's prominence in few-shot and zero-shot learning scenarios has highlighted the potential of such models, yet its limited applicability to Chinese tasks remains a barrier due to the non-availability of its parameters and limited Chinese training data.
Technical Approach
CPM adheres to a transformer architecture, modelling itself closely after GPT-3 in terms of generative capabilities but is tailored for Chinese language contexts. It constructs a sub-word vocabulary suited to Chinese text, recognizing that Chinese semantic richness might suffer using traditional character-level or BERT-based vocabularies. This vocabulary is uniquely built to accommodate both words and characters for better language representation. Additionally, the model utilizes an increased batch size of 3,072 tokens to effectively address the sparseness of the word distribution, indicating stability in training.
Experimental Results
The performance of CPM has been benchmarked across various tasks:
- Text Classification: Across datasets such as TNEWS, IFLYTEK, and OCNLI, CPM demonstrated promising accuracy, especially in zero-shot settings, outperforming smaller-sized models and underscoring the advantage of larger parameter counts.
- Chinese Idiom Cloze (ChID): The model's ability was benchmarked in both supervised and unsupervised settings. CPM-Large surpassed its smaller counterparts even in unsupervised scenarios, emphasizing the model's learned language proficiency.
- Dialogue Generation: Using the Short-Text Conversation (STC) dataset, CPM was shown to achieve a higher diversity in generated responses compared to other state-of-the-art models, particularly when evaluated in few-shot settings.
- Question Answering: Performances on CMRC2018 and DuReader benchmarks highlighted the limitations of the model in generating precise answers without tuning, though results improved with one-shot learning strategies.
- Entity Generation: The model’s ability to generate accurate tail entities in the XLORE dataset was notable, especially in few-shot conditions, revealing its capacity for factual knowledge probing.
Implications and Future Directions
The creation of CPM showcases a significant step towards refining NLP tools for Chinese. Practical applications range from enhancing automated essay scoring systems to improving information retrieval algorithms in Chinese contexts. From a theoretical standpoint, CPM contributes to insights on model scalability and its impact on LLM capabilities.
Looking forward, the authors suggest optimizations in pre-training frameworks to handle computational costs more effectively, including distributed training strategies and model compression techniques. Expansion plans include incorporating multi-lingual corpora to develop a comprehensive multi-lingual model, and integrating structured data like knowledge graphs for improved contextual language understanding.
In essence, CPM stands as a pivotal advancement in NLP for the Chinese language, setting a benchmark for future research endeavors aimed at non-English language applications.