Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CINO: A Chinese Minority Pre-trained Language Model (2202.13558v2)

Published 28 Feb 2022 in cs.CL

Abstract: Multilingual pre-trained LLMs have shown impressive performance on cross-lingual tasks. It greatly facilitates the applications of natural language processing on low-resource languages. However, there are still some languages that the current multilingual models do not perform well on. In this paper, we propose CINO (Chinese Minority Pre-trained LLM), a multilingual pre-trained LLM for Chinese minority languages. It covers Standard Chinese, Yue Chinese, and six other ethnic minority languages. To evaluate the cross-lingual ability of the multilingual model on ethnic minority languages, we collect documents from Wikipedia and news websites, and construct two text classification datasets, WCM (Wiki-Chinese-Minority) and CMNews (Chinese-Minority-News). We show that CINO notably outperforms the baselines on various classification tasks. The CINO model and the datasets are publicly available at http://cino.hfl-rc.com.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ziqing Yang (29 papers)
  2. Zihang Xu (11 papers)
  3. Yiming Cui (80 papers)
  4. Baoxin Wang (15 papers)
  5. Min Lin (96 papers)
  6. Dayong Wu (16 papers)
  7. Zhigang Chen (102 papers)
Citations (23)

Summary

We haven't generated a summary for this paper yet.