Evaluation of Traditional Chinese LLMs: A Comprehensive Benchmark Approach
In this paper, the authors address the critical need for robust benchmarks to evaluate the performance of LLMs in Traditional Chinese. While significant progress has been made with numerous benchmarks available for English LLMs, Traditional Chinese lacks such comprehensive evaluation frameworks. To bridge this gap, the paper introduces a comprehensive suite of benchmarks designed to assess various capabilities of LLMs specifically tailored for Traditional Chinese. These cover a broad range of tasks including, but not limited to, contextual question answering, world knowledge evaluation, summarization, classification, and table understanding.
Benchmark Design and Implementation
The benchmarks proposed in this paper originate from a thoughtful adaptation of existing English datasets, translated into Traditional Chinese where necessary. Among these, existing datasets such as the Delta Reading Comprehension Dataset (DRCD) and Taiwanese Trivia Question Answering (TTQA) are used for contextual QA and world knowledge tasks. The novel dataset, Taiwan Massive Multitask Language Understanding (TMMLU), is introduced to evaluate a model's competency across 55 subjects, leveraging educational exams from Taiwan. Classification and summarization tasks utilize translated datasets from English benchmarks like IMDB and XSum.
Evaluation and Numerical Insights
The performance of several models, including GPT-3.5, Taiwan-LLaMa-v1.0, and a proprietary model series, Model 7-C, was evaluated using the proposed benchmarks. The results highlighted that GPT-3.5 consistently achieves superior performance across the evaluated tasks, setting a high standard for Traditional Chinese models. Notably, Model 7-C demonstrated comparable effectiveness to GPT-3.5 on specific benchmarks such as DRCD and XSum-TC, showcasing competitive capabilities in contextual question answering and summarization.
A crucial observation from the evaluations is the inadequacy in table understanding tasks, where open-source models exhibit significant hallucinations, coupled with consistent underperformance in summarization attributed to deviation in output structure from the target summaries.
Open-Ended Generation and Model Helpfulness
In assessing models’ utility in generating helpful responses, the TAIDE-14 tasks served as a benchmark. Model 7-C-Chat, the chat-optimized variant of Model 7-C, achieved notable performance, occasionally surpassing GPT-3.5 in helpfulness, demonstrating its proficiency in Traditional Chinese text generation tasks across diverse domains.
Implications and Future Outlook
The development and open-sourcing of such benchmarks is pivotal for advancing research on Traditional Chinese LLMs. By providing a foundation for broad-spectrum evaluation, the paper not only benchmarks current models but also defines areas for future enhancements in LLMs. Encouragingly, this work opens new avenues for research, inviting both academic and industry stakeholders to refine models in adherence to these comprehensive standards.
This initiative underscores the importance of culturally and linguistically appropriate evaluation frameworks, offering a pathway for equitable progress across languages. The open-source release of these resources is expected to facilitate collaboration and foster innovation, driving the development of more sophisticated and inclusive AI systems in the future. This research could, therefore, serve as a catalyst for nuanced advancements in the field of Traditional Chinese NLP and beyond.