Emergent Mind

Abstract

As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50%, even when provided with in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25%. This highlights significant room for improvement in LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

YouTube
References
  1. Falcon-40B: an open large language model with state-of-the-art performance
  2. Program Synthesis with Large Language Models
  3. Evaluating Large Language Models Trained on Code
  4. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  5. Training Verifiers to Solve Math Word Problems
  6. Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
  7. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  8. News Summarization and Evaluation in the Era of GPT-3
  9. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  10. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  11. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1693–1701.
  12. Understanding by understanding not: Modeling negation in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 1301–1312. Association for Computational Linguistics.
  13. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
  14. Nora Kassner and Hinrich Schütze. 2020. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7811–7818. Association for Computational Linguistics.
  15. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466.
  16. Bactrian-x : A multilingual replicable instruction-following model with low-rank adaptation
  17. MultiSpanQA: A dataset for multi-span question answering. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1250–1260, Seattle, United States. Association for Computational Linguistics.
  18. Batgpt: A bidirectional autoregessive talker from generative pre-trained transformer. .
  19. Holistic evaluation of language models
  20. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252. Association for Computational Linguistics.
  21. M3ke: A massive multi-level multi-subject knowledge evaluation benchmark for chinese large language models
  22. G-eval: Nlg evaluation using gpt-4 with better human alignment
  23. Crosslingual Generalization through Multitask Finetuning
  24. OpenAI. 2023. Gpt-4 technical report.
  25. OpenLMLab. 2023. Moss.
  26. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
  27. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  28. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8732–8740. AAAI Press.
  29. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  30. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
  31. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  32. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.

  33. LLaMA: Open and Efficient Foundation Language Models
  34. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 3261–3275.
  35. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  36. LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
  37. CLUE: A chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 4762–4772. International Committee on Computational Linguistics.
  38. Liang Xu and others from SuperCLUE team. 2023. Superclue: A benchmark for foundation models in chinese. https://github.com/CLUEbench/SuperCLUE.

  39. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics.
  40. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
  41. Measuring Massive Multitask Chinese Understanding
  42. OPT: Open Pre-trained Transformer Language Models
  43. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Show All 43