Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CPM: A Large-scale Generative Chinese Pre-trained Language Model (2012.00413v1)

Published 1 Dec 2020 in cs.CL

Abstract: Pre-trained LLMs (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3, with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the Chinese Pre-trained LLM (CPM) with generative pre-training on large-scale Chinese training data. To the best of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained LLM, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation, cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many NLP tasks in the settings of few-shot (even zero-shot) learning. The code and parameters are available at https://github.com/TsinghuaAI/CPM-Generate.

Overview of CPM: A Large-scale Generative Chinese Pre-trained LLM

This paper introduces the Chinese Pre-trained LLM (CPM), a significant contribution to the field of NLP with a focus on Chinese text corpora. CPM is a transformer-based autoregressive LLM, featuring 2.6 billion parameters and trained on 100GB of Chinese data. This model positions itself as the largest Chinese pre-trained LLM to date, aiming to facilitate numerous downstream NLP tasks such as conversation, essay generation, and language understanding.

Motivation and Objectives

The motivation behind CPM stems from the challenges associated with applying existing large-scale models like GPT-3, primarily trained on English corpora, to Chinese NLP tasks. GPT-3's prominence in few-shot and zero-shot learning scenarios has highlighted the potential of such models, yet its limited applicability to Chinese tasks remains a barrier due to the non-availability of its parameters and limited Chinese training data.

Technical Approach

CPM adheres to a transformer architecture, modelling itself closely after GPT-3 in terms of generative capabilities but is tailored for Chinese language contexts. It constructs a sub-word vocabulary suited to Chinese text, recognizing that Chinese semantic richness might suffer using traditional character-level or BERT-based vocabularies. This vocabulary is uniquely built to accommodate both words and characters for better language representation. Additionally, the model utilizes an increased batch size of 3,072 tokens to effectively address the sparseness of the word distribution, indicating stability in training.

Experimental Results

The performance of CPM has been benchmarked across various tasks:

  1. Text Classification: Across datasets such as TNEWS, IFLYTEK, and OCNLI, CPM demonstrated promising accuracy, especially in zero-shot settings, outperforming smaller-sized models and underscoring the advantage of larger parameter counts.
  2. Chinese Idiom Cloze (ChID): The model's ability was benchmarked in both supervised and unsupervised settings. CPM-Large surpassed its smaller counterparts even in unsupervised scenarios, emphasizing the model's learned language proficiency.
  3. Dialogue Generation: Using the Short-Text Conversation (STC) dataset, CPM was shown to achieve a higher diversity in generated responses compared to other state-of-the-art models, particularly when evaluated in few-shot settings.
  4. Question Answering: Performances on CMRC2018 and DuReader benchmarks highlighted the limitations of the model in generating precise answers without tuning, though results improved with one-shot learning strategies.
  5. Entity Generation: The model’s ability to generate accurate tail entities in the XLORE dataset was notable, especially in few-shot conditions, revealing its capacity for factual knowledge probing.

Implications and Future Directions

The creation of CPM showcases a significant step towards refining NLP tools for Chinese. Practical applications range from enhancing automated essay scoring systems to improving information retrieval algorithms in Chinese contexts. From a theoretical standpoint, CPM contributes to insights on model scalability and its impact on LLM capabilities.

Looking forward, the authors suggest optimizations in pre-training frameworks to handle computational costs more effectively, including distributed training strategies and model compression techniques. Expansion plans include incorporating multi-lingual corpora to develop a comprehensive multi-lingual model, and integrating structured data like knowledge graphs for improved contextual language understanding.

In essence, CPM stands as a pivotal advancement in NLP for the Chinese language, setting a benchmark for future research endeavors aimed at non-English language applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (25)
  1. Zhengyan Zhang (46 papers)
  2. Xu Han (270 papers)
  3. Hao Zhou (351 papers)
  4. Pei Ke (37 papers)
  5. Yuxian Gu (21 papers)
  6. Deming Ye (10 papers)
  7. Yujia Qin (41 papers)
  8. Yusheng Su (21 papers)
  9. Haozhe Ji (11 papers)
  10. Jian Guan (65 papers)
  11. Fanchao Qi (33 papers)
  12. Xiaozhi Wang (51 papers)
  13. Yanan Zheng (13 papers)
  14. Guoyang Zeng (14 papers)
  15. Huanqi Cao (6 papers)
  16. Shengqi Chen (8 papers)
  17. Daixuan Li (5 papers)
  18. Zhenbo Sun (4 papers)
  19. Zhiyuan Liu (433 papers)
  20. Minlie Huang (225 papers)
Citations (110)
Github Logo Streamline Icon: https://streamlinehq.com