Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Advancing the Evaluation of Traditional Chinese Language Models: Towards a Comprehensive Benchmark Suite (2309.08448v2)

Published 15 Sep 2023 in cs.CL
Advancing the Evaluation of Traditional Chinese Language Models: Towards a Comprehensive Benchmark Suite

Abstract: The evaluation of LLMs is an essential task in the field of language understanding and generation. As LLMs continue to advance, the need for effective benchmarks to assess their performance has become imperative. In the context of Traditional Chinese, there is a scarcity of comprehensive and diverse benchmarks to evaluate the capabilities of LLMs, despite the existence of certain benchmarks such as DRCD, TTQA, CMDQA, and FGC dataset. To address this gap, we propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate LLMs in Traditional Chinese. These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding. The proposed benchmarks offer a comprehensive evaluation framework, enabling the assessment of LLMs' capabilities across different tasks. In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks. The evaluation results highlight that our model, Model 7-C, achieves performance comparable to GPT-3.5 with respect to a part of the evaluated capabilities. In an effort to advance the evaluation of LLMs in Traditional Chinese and stimulate further research in this field, we have open-sourced our benchmark and opened the model for trial.

Evaluation of Traditional Chinese LLMs: A Comprehensive Benchmark Approach

In this paper, the authors address the critical need for robust benchmarks to evaluate the performance of LLMs in Traditional Chinese. While significant progress has been made with numerous benchmarks available for English LLMs, Traditional Chinese lacks such comprehensive evaluation frameworks. To bridge this gap, the paper introduces a comprehensive suite of benchmarks designed to assess various capabilities of LLMs specifically tailored for Traditional Chinese. These cover a broad range of tasks including, but not limited to, contextual question answering, world knowledge evaluation, summarization, classification, and table understanding.

Benchmark Design and Implementation

The benchmarks proposed in this paper originate from a thoughtful adaptation of existing English datasets, translated into Traditional Chinese where necessary. Among these, existing datasets such as the Delta Reading Comprehension Dataset (DRCD) and Taiwanese Trivia Question Answering (TTQA) are used for contextual QA and world knowledge tasks. The novel dataset, Taiwan Massive Multitask Language Understanding (TMMLU), is introduced to evaluate a model's competency across 55 subjects, leveraging educational exams from Taiwan. Classification and summarization tasks utilize translated datasets from English benchmarks like IMDB and XSum.

Evaluation and Numerical Insights

The performance of several models, including GPT-3.5, Taiwan-LLaMa-v1.0, and a proprietary model series, Model 7-C, was evaluated using the proposed benchmarks. The results highlighted that GPT-3.5 consistently achieves superior performance across the evaluated tasks, setting a high standard for Traditional Chinese models. Notably, Model 7-C demonstrated comparable effectiveness to GPT-3.5 on specific benchmarks such as DRCD and XSum-TC, showcasing competitive capabilities in contextual question answering and summarization.

A crucial observation from the evaluations is the inadequacy in table understanding tasks, where open-source models exhibit significant hallucinations, coupled with consistent underperformance in summarization attributed to deviation in output structure from the target summaries.

Open-Ended Generation and Model Helpfulness

In assessing models’ utility in generating helpful responses, the TAIDE-14 tasks served as a benchmark. Model 7-C-Chat, the chat-optimized variant of Model 7-C, achieved notable performance, occasionally surpassing GPT-3.5 in helpfulness, demonstrating its proficiency in Traditional Chinese text generation tasks across diverse domains.

Implications and Future Outlook

The development and open-sourcing of such benchmarks is pivotal for advancing research on Traditional Chinese LLMs. By providing a foundation for broad-spectrum evaluation, the paper not only benchmarks current models but also defines areas for future enhancements in LLMs. Encouragingly, this work opens new avenues for research, inviting both academic and industry stakeholders to refine models in adherence to these comprehensive standards.

This initiative underscores the importance of culturally and linguistically appropriate evaluation frameworks, offering a pathway for equitable progress across languages. The open-source release of these resources is expected to facilitate collaboration and foster innovation, driving the development of more sophisticated and inclusive AI systems in the future. This research could, therefore, serve as a catalyst for nuanced advancements in the field of Traditional Chinese NLP and beyond.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chan-Jan Hsu (16 papers)
  2. Chang-Le Liu (10 papers)
  3. Feng-Ting Liao (8 papers)
  4. Yi-Chang Chen (14 papers)
  5. Da-shan Shiu (27 papers)
  6. Po-chun Hsu (25 papers)
Citations (10)