AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large Language Models (2403.06574v1)
Abstract: Given the importance of ancient Chinese in capturing the essence of rich historical and cultural heritage, the rapid advancements in LLMs necessitate benchmarks that can effectively evaluate their understanding of ancient contexts. To meet this need, we present AC-EVAL, an innovative benchmark designed to assess the advanced knowledge and reasoning capabilities of LLMs within the context of ancient Chinese. AC-EVAL is structured across three levels of difficulty reflecting different facets of language comprehension: general historical knowledge, short text understanding, and long text comprehension. The benchmark comprises 13 tasks, spanning historical facts, geography, social customs, art, philosophy, classical poetry and prose, providing a comprehensive assessment framework. Our extensive evaluation of top-performing LLMs, tailored for both English and Chinese, reveals a substantial potential for enhancing ancient text comprehension. By highlighting the strengths and weaknesses of LLMs, AC-EVAL aims to promote their development and application forward in the realms of ancient Chinese language education and scholarly research. The AC-EVAL data and evaluation code are available at https://github.com/yuting-wei/AC-EVAL.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
- Sentiment-controllable chinese poetry generation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 4925–4931.
- OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234.
- Towards effective ancient chinese translation: Dataset, model, and evaluation. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 416–427. Springer.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems.
- C-clue: a benchmark of classical chinese based on a crowdsourcing system for knowledge graph construction. In Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction: 6th China Conference, CCKS 2021, Guangzhou, China, November 4-7, 2021, Proceedings 6, pages 295–301. Springer.
- Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212.
- Cleva: Chinese language models evaluation platform. arXiv preprint arXiv:2308.04813.
- A multi-modal knowledge graph for classical Chinese poetry. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2318–2326.
- Holistic evaluation of language models. Transactions on Machine Learning Research.
- Contrastive learning between classical and modern chinese for classical chinese machine reading comprehension. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(2):1–22.
- The construction and analysis of classical chinese poetry knowledge graph (in Chinese). Journal of Computer Research and Development, 57(6):1252.
- Zuo zhuan Ancient Chinese dataset for word sense disambiguation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, pages 129–135.
- A sentiment and style controllable approach for chinese poetry generation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, page 4784–4788.
- The construction and application of Ancient Chinese corpus with word sense annotation. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 549–563.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- That slepen al the nyght with open ye! cross-era sequence segmentation with switch-memory. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7830–7840.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Pengyu Wang and Zhichen Ren. 2022. The uncertainty-based retrieval framework for Ancient Chinese CWS and POS. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 164–168.
- Enhancing Ancient Chinese understanding with derived noisy syntax trees. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 83–92.
- Emergent abilities of large language models. Transactions on Machine Learning Research.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Superclue: A comprehensive chinese large language model benchmark. arXiv preprint arXiv:2307.15020.
- Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
- Hui Zeng. 2023. Measuring massive multitask chinese understanding. arXiv preprint arXiv:2304.12986.
- Evaluating the generation capabilities of large chinese language models. arXiv preprint arXiv:2308.04823.
- Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Jiuge: A human-machine collaborative Chinese classical poetry generation system. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 25–30.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
- WYWEB: A NLP evaluation benchmark for classical Chinese. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3294–3319.
- Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
- Yuting Wei (47 papers)
- Yuanxing Xu (3 papers)
- Xinru Wei (6 papers)
- Simin Yang (4 papers)
- Yangfu Zhu (1 paper)
- Yuqing Li (19 papers)
- Di Liu (107 papers)
- Bin Wu (202 papers)