CINO: A Chinese Minority Pre-trained Language Model (2202.13558v2)

Published 28 Feb 2022 in cs.CL

Abstract: Multilingual pre-trained LLMs have shown impressive performance on cross-lingual tasks. It greatly facilitates the applications of natural language processing on low-resource languages. However, there are still some languages that the current multilingual models do not perform well on. In this paper, we propose CINO (Chinese Minority Pre-trained LLM), a multilingual pre-trained LLM for Chinese minority languages. It covers Standard Chinese, Yue Chinese, and six other ethnic minority languages. To evaluate the cross-lingual ability of the multilingual model on ethnic minority languages, we collect documents from Wikipedia and news websites, and construct two text classification datasets, WCM (Wiki-Chinese-Minority) and CMNews (Chinese-Minority-News). We show that CINO notably outperforms the baselines on various classification tasks. The CINO model and the datasets are publicly available at http://cino.hfl-rc.com.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (7)

Ziqing Yang (29 papers)
Zihang Xu (11 papers)
Yiming Cui (80 papers)
Baoxin Wang (15 papers)
Min Lin (96 papers)
Dayong Wu (16 papers)
Zhigang Chen (102 papers)

Citations (23)

View on Semantic Scholar

CINO: A Chinese Minority Pre-trained Language Model (2202.13558v2)

Related Papers