Measuring Taiwanese Mandarin Language Understanding (2403.20180v1)
Abstract: The evaluation of LLMs has drawn substantial attention in the field recently. This work focuses on evaluating LLMs in a Chinese context, specifically, for Traditional Chinese which has been largely underrepresented in existing benchmarks. We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in LLMs, under the context of Taiwanese Mandarin. TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels. In addition, we curate chain-of-thought-like few-shot explanations for each subject to facilitate the evaluation of complex reasoning skills. To establish a comprehensive baseline, we conduct extensive experiments and analysis on 24 advanced LLMs. The results suggest that Chinese open-weight models demonstrate inferior performance comparing to multilingual proprietary ones, and open-weight models tailored for Taiwanese Mandarin lag behind the Simplified-Chinese counterparts. The findings indicate great headrooms for improvement, and emphasize the goal of TMLU to foster the development of localized Taiwanese-Mandarin LLMs. We release the benchmark and evaluation scripts for the community to promote future research.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Yi: Open foundation models by 01.ai, 2024.
- Qwen technical report, 2023.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- What’s going on with the open llm leaderboard. Hugging Face Blog (June 2023). URL: https://huggingface. co/blog/evaluatingmmlu-leaderboard, 2023.
- Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023.
- News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356, 2022.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- Advancing the evaluation of traditional chinese language models: Towards a comprehensive benchmark suite. arXiv preprint arXiv:2309.08448, 2023.
- Breeze-7b technical report, 2024.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
- Bloom: A 176b-parameter open-access multilingual language model. 2022.
- Japanese stablelm base alpha 7b. URL [https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b).
- Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Taiwan llm: Bridging the linguistic divide with a culturally aligned language model. arXiv preprint arXiv:2311.17487, 2023.
- G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2511–2522, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URL https://aclanthology.org/2023.emnlp-main.153.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150, 2011.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1797–1807, 2018.
- Seallms–large language models for southeast asia. arXiv preprint arXiv:2312.00738, 2023.
- NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 10776–10787, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.722. URL https://aclanthology.org/2023.findings-emnlp.722.
- Did chatgpt cheat on your test?, Jun 2023b. URL https://hitz-zentroa.github.io/lm-contamination/blog/.
- Drcd: A chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920, 2018.
- Detecting pretraining data from large language models. In NeurIPS 2023 Workshop on Regulatable ML, 2023.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023.
- STPI. 2020「科技大擂台與ai對話」訓練資料集. https://scidm.nchc.org.tw/dataset/grandchallenge2020, 2020.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- An improved traditional chinese evaluation suite for foundation model. arXiv preprint arXiv:2403.01858, 2024.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
- Comparison of changes between mainland china and taiwan. In Chinese Lexical Semantics: 21st Workshop, CLSW 2020, Hong Kong, China, May 28–30, 2020, Revised Selected Papers 21, pp. 686–710. Springer, 2021.
- Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022.
- Polylm: An open source polyglot large language model. arXiv preprint arXiv:2307.06018, 2023.
- Superclue: A comprehensive chinese large language model benchmark. arXiv preprint arXiv:2307.15020, 2023.
- mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 483–498, 2021.
- Baichuan 2: Open large-scale language models, 2023.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- A study on differences between taiwanese mandarin and mainland mandarin in vocabulary. In 3rd International Conference on Culture, Education and Economic Development of Modern Society (ICCESE 2019), pp. 212–215. Atlantis Press, 2019.
- Po-Heng Chen (3 papers)
- Sijia Cheng (3 papers)
- Wei-Lin Chen (12 papers)
- Yen-Ting Lin (117 papers)
- Yun-Nung Chen (104 papers)