CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models (2402.13109v2)
Abstract: The advancement of LLMs has enhanced the ability to generalize across a wide range of unseen NLP tasks through instruction-following. Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances. Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts. This work not only uncovers the current limitations of LLMs in handling Chinese language tasks but also sets a new standard for future LLM generalizability research, pushing towards the development of more adaptable, culturally informed, and linguistically diverse models.
- Constructing a bilingual Hadith corpus using a segmentation tool. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3390–3398, Marseille, France. European Language Resources Association.
- Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- MindCraft: Theory of mind modeling for situated dialogue in collaborative tasks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1112–1125, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Diabla: A corpus of bilingual spontaneous written dialogues for machine translation. Language Resources and Evaluation, 55:635–660.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- BELLEGroup. 2023. Belle: Be everyone’s large language model engine. https://github.com/LianjiaTech/BELLE.
- Arie Ben-David. 2008. Comparison of classification accuracy using cohen’s weighted kappa. Expert Syst. Appl., 34(2):825–832.
- BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- Tigerbot: An open multilingual multitask LLM. CoRR, abs/2312.08688.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
- Textworld: A learning environment for text-based games. CoRR, abs/1806.11532.
- Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
- DeepSeek-AI. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954.
- Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161.
- Moral stories: Situated reasoning about norms, intents, actions, and their consequences. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 698–718, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Content "European Commission, Directorate-General for Communications Networks and Technology.". 2017. "spanish-english website parallel corpus.".
- The gem benchmark: Natural language generation, its evaluation and metrics. arXiv preprint arXiv:2102.01672.
- Folio: Natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- An enhanced rbmt: When rbmt outperforms modern data-driven translators. IETE Technical Review, 39(6):1473–1484.
- Followbench: A multi-level fine-grained constraints following benchmark for large language models. arXiv preprint arXiv:2310.20410.
- Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Brenden M Lake and Marco Baroni. 2017. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. arxiv.
- Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212.
- Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434.
- Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
- Data Market. 2018. shujujishi.com. http://shujujishi.com/dataset/a037ab86-7727-487b-9a46-2936b0be076b.html. Accessed 16-02-2024.
- Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.
- Yusuke Oda. 2016. Small parallel enja. https://github.com/odashi/small_parallel_enja.
- Jiaxin Pei and David Jurgens. 2020. Quantifying intimacy in language. arXiv preprint arXiv:2011.03020.
- Jiaxin Pei and David Jurgens. 2021. Measuring sentence-level and aspect-level (un)certainty in science communications. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- SemEval-2022 task 4: Patronizing and condescending language detection. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 298–307, Seattle, United States. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer.
- Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.
- Morphological processing for english-tamil statistical machine translation. In Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012), pages 113–122.
- Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
- Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. arXiv preprint arXiv:2310.18018.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
- Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
- Parth Shah and Vishvajit Bakrola. 2019. Neural machine translation system of indic languages-an attention based approach. In 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), pages 1–5. IEEE.
- Moss: Training conversational language models from synthetic data.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
- Development and validation of a corpus for machine humor comprehension. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1346–1352.
- SemEval-2020 task 4: Commonsense validation and explanation. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 307–321, Barcelona (online). International Committee for Computational Linguistics.
- Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. CoRR, abs/2209.02970.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
- Telechat technical report. CoRR, abs/2401.03804.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Wikipedia. 2024. List of China Mainland Internet Language — Wikipedia, the free encyclopedia. http://zh.wikipedia.org/w/index.php?title=%E4%B8%AD%E5%9B%BD%E5%A4%A7%E9%99%86%E7%BD%91%E7%BB%9C%E7%94%A8%E8%AF%AD%E5%88%97%E8%A1%A8&oldid=81048845.
- Musied: A benchmark for event detection from multi-source heterogeneous informal texts. arXiv preprint arXiv:2211.13896.
- Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986.
- Fewclue: A chinese few-shot learning evaluation benchmark. arXiv preprint arXiv:2107.07498.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Cuge: A chinese language understanding and generation evaluation benchmark. arXiv preprint arXiv:2112.13610.
- Crossfit: A few-shot learning challenge for cross-task generalization in nlp. arXiv preprint arXiv:2104.08835.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
- GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR).
- Corgi-pm: A chinese corpus for gender bias probing and mitigation. arXiv preprint arXiv:2301.00395.
- Multilingual large language models are not (yet) code-switchers. arXiv preprint arXiv:2305.14235.
- Tencentpretrain: A scalable and flexible toolkit for pre-training models of different modalities. In ACL (demo), pages 217–225. Association for Computational Linguistics.
- Inducing positive perspectives with text reframing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Online and Dublin, Ireland. Association for Computational Linguistics.
- Ge Zhang (170 papers)
- Xingwei Qu (30 papers)
- Jiali Li (13 papers)
- Zhaoqun Li (11 papers)
- Zekun Wang (50 papers)
- Hao Li (803 papers)
- Ruibin Yuan (43 papers)
- Yinghao Ma (24 papers)
- Kai Zhang (542 papers)
- Wangchunshu Zhou (73 papers)
- Yiming Liang (22 papers)
- Lei Zhang (1689 papers)
- Lei Ma (195 papers)
- Jiajun Zhang (176 papers)
- Zuowen Li (1 paper)
- Stephen W. Huang (9 papers)
- Chenghua Lin (127 papers)
- Jie Fu (229 papers)
- Yizhi Li (43 papers)