C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models (2405.17732v2)
Abstract: Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of LLMs for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities of LLMs. To fill this gap, this paper introduces C${3}$bench, a Comprehensive Classical Chinese understanding benchmark, which comprises 50,000 text pairs for five primary CCU tasks, including classification, retrieval, named entity recognition, punctuation, and translation. Furthermore, the data in C${3}$bench originates from ten different domains, covering most of the categories in classical Chinese. Leveraging the proposed C${3}$bench, we extensively evaluate the quantitative performance of 15 representative LLMs on all five CCU tasks. Our results not only establish a public leaderboard of LLMs' CCU capabilities but also gain some findings. Specifically, existing LLMs are struggle with CCU tasks and still inferior to supervised models. Additionally, the results indicate that CCU is a task that requires special attention. We believe this study could provide a standard benchmark, comprehensive baselines, and valuable insights for the future advancement of LLM-based CCU research. The evaluation pipeline and dataset are available at \url{https://github.com/SCUT-DLVCLab/C3bench}.
- BERT-based named entity recognition in Chinese Twenty-Four Histories. In Proc. WISA, pages 289–301. Springer, 2020.
- CNN-BiLSTM-CRF model for term extraction in Chinese corpus. In Proc. WISA, pages 267–274. Springer, 2018.
- Towards better translations from classical to modern Chinese: A new dataset and a new method. In Proc. NLPCC, pages 387–399. Springer, 2023.
- Time-Aware ancient Chinese text translation and inference. In Proc. LChange, pages 1–6, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1):5485–5551, 2020.
- OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- PALM: Scaling language modeling with pathways. J. Mach. Learn. Res., 24(240):1–113, 2023.
- Language models are few-shot learners. In Proc. NeurIPS, pages 1877–1901, 2020.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Can pretrained english language models benefit non-english nlp systems in low-resource scenarios? IEEE-ACM Trans. Audio Speech Lang., 2023.
- Dictionary-based phrase-level prompting of large language models for machine translation. arXiv preprint arXiv:2302.07856, 2023.
- Named-entity recognition for a low-resource language using pre-trained language model. In Proc. SAC, pages 837–844, 2022.
- SikuGPT: A generative pre-trained model for intelligent information processing of ancient texts from the perspective of digital humanities. arXiv preprint arXiv:2304.07778, 2023.
- Translating ancient Chinese to modern Chinese at scale: A large language model-based approach. In Proc. ALT, pages 61–69, 2023.
- EvaHan2023: Overview of the first international ancient Chinese translation bakeoff. In Proc. ALT2023, pages 1–14, 2023.
- C-CLUE: A benchmark of classical Chinese based on a crowdsourcing system for knowledge graph construction. In Proc. CCKS, pages 295–301. Springer, 2021.
- Can large langauge model comprehend ancient Chinese? A preliminary test on aclue. In Proc. ALP, pages 80–87, 2023.
- Attention is all you need. In Proc. NeurIPS, pages 6000–6010, 2017.
- Improving language understanding by generative pre-training. 2018.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT, volume 1, page 2, 2019.
- Training language models to follow instructions with human feedback. In Proc. NeurIPS, 2022.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023.
- GLM-130B: An open bilingual pre-trained model. In Proc. ICLR, 2022.
- GLM: General language model pretraining with autoregressive blank infilling. In Proc. ACL, pages 320–335, 2022.
- Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Exploring hybrid character-words representational unit in classical-to-modern Chinese machine translation. In Proc. IALP, pages 33–36. IEEE, 2015.
- GujiBERT and GujiGPT: Construction of intelligent information processing foundation language models for ancient texts. arXiv preprint arXiv:2307.05354, 2023.
- CLUE: A Chinese language understanding evaluation benchmark. In Proc. COLING, pages 4762–4772, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics.
- CUGE: A Chinese language understanding and generation evaluation benchmark. arXiv preprint arXiv:2112.13610, 2021.
- C-Eval: A multi-level multi-discipline Chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
- CMMLU: Measuring massive multitask language understanding in Chinese, 2023.
- AGIEval: A human-centric benchmark for evaluating foundation models, 2023.
- PromptBench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023.
- Evaluating hallucinations in Chinese large language models. arXiv preprint arXiv:2310.03368, 2023.
- CASIA-AHCDB: A large-scale Chinese ancient handwritten characters database. In Proc. ICDAR, pages 793–798. IEEE, 2019.
- Joint layout analysis, character detection and recognition for historical document digitization. In Proc. ICFHR, pages 31–36. IEEE, 2020.
- ICDAR 2019 historical document reading challenge on large structured Chinese family records. In Proc. ICDAR, pages 1499–1504. IEEE, 2019.
- Scut-cab: A new benchmark dataset of ancient Chinese books with complex layouts for document layout analysis. In Proc. ICFHR, pages 436–451. Springer, 2022.
- M5HisDoc: A large-scale multi-style Chinese historical document analysis benchmark. In Proc. NeurIPS Datasets and Benchmarks Track, 2023.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- MOSS: Training conversational language models from synthetic data. 2023.
- ERNIE: Enhanced language representation with informative entities. In Proc. ACL, pages 1441–1451, 2019.
- iFLYTEK. Spark. https://xinghuo.xfyun.cn, 2023.
- MiniMax. abab5-chat. https://api.minimax.chat, 2023.
- Word-level BERT-CNN-RNN model for Chinese punctuation restoration. In Proc. ICCC, pages 1629–1633, 2020.
- Chinese named entity recognition: The state of the art. Neurocomputing, 473:37–53, 2022.
- BLEU: a method for automatic evaluation of machine translation. In Proc. ACL, pages 311–318, 2002.
- Efficient and effective text encoding for Chinese LLaMA and Alpaca. arXiv preprint arXiv:2304.08177, 2023.
- Jiahuan Cao (4 papers)
- Yongxin Shi (7 papers)
- Dezhi Peng (21 papers)
- Yang Liu (2253 papers)
- Lianwen Jin (116 papers)