Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models (2411.07140v2)

Published 11 Nov 2024 in cs.CL

Abstract: New LLM evaluation benchmarks are important to align with the rapid development of LLMs. In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of LLMs to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate based on OpenAI API. Based on Chinese SimpleQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to better understand the Chinese factuality abilities of their models and facilitate the growth of foundation models.

PDF Abstract

Evaluation of Chinese LLM Factuality with Chinese SimpleQA

The paper "Chinese SimpleQA" introduces a novel benchmark intended for evaluating the factuality of LLMs in the Chinese language context. With the rapid advancement of LLMs, linguistic breadth remains crucial for comprehensive evaluation benchmarks. Following the precedent set by the SimpleQA framework, this new benchmark advances this need by focusing on Chinese rather than English. Chinese SimpleQA emerges as a critical tool to probe the factual accuracy of LLMs when addressing fact-based, short-form questions across a diverse array of topics and subtopics.

Dataset Composition and Quality Assurance

Chinese SimpleQA is constructed to provide an encompassing dataset featuring 3,000 high-quality Q&A pairs across six principal topics, each of which is further divided into 99 fine-grained subtopics. These topics encompass "Chinese Culture," "Humanities," "Engineering, Technology, and Applied Sciences (ETAS)," "Life, Art, and Culture," "Society," and "Natural Science." One primary focus is to ensure diversity and coverage, facilitating the assessment of LLMs across various domains of knowledge.

In content and data quality, the benchmark was meticulously curated to eliminate ambiguity and subjectivity, ensuring that all questions have definitive, single correct answers. The rigorous verification process includes dual annotation, strict validation against authoritative sources, and additional checks via external retrieval tools to verify the factuality of the LLM-generated answers, leveraging approaches such as Retrieval-Augmented Generation (RAG).

Evaluation Metrics and Baseline Results

The evaluation of LLMs using the Chinese SimpleQA adheres to straightforward and measurable criteria, including metrics like "Correct" (CO), "Not Attempted" (NA), "Incorrect" (IN), "Correct given attempted" (CGA), and the F-score, with these metrics providing insights into the LLMs' factuality performance. The dataset demands minimal computational resources for testing, which is an essential feature in making these assessments efficient.

The baseline results highlight the capability of different models in handling fact-based queries, with notable models such as o1-preview and Doubao-pro-32k achieving near-parity at the top of the leaderboards for correct factual responses. However, it was evident that performance varies widely across subtopics, pointing to model-specific strengths and weaknesses that merit further exploration.

Key Observations and Technological Implications

Several intriguing findings were noted from the evaluation on the Chinese SimpleQA benchmark:

Model Size Correlates with Performance: Larger models tend to perform better, evidenced by consistent improvements as model size increases within series like Qwen2.5 and InternLM. This recurring trend aligns with the expectation that model size facilitates a more comprehensive internalization of factual knowledge.
Language-Specific Performance Variance: The paper distinctly shows that models focusing on Chinese cultural contexts like Doubao-pro-32k outperform general models such as GPT-4 in questions related to Chinese content, reflecting the advantage of linguistic and cultural specialization.
The Role of RAG: Integrating retrieval-based strategies significantly enhanced the performance of models, indicating that access to external databases can substantially boost LLM factual accuracy. This observation highlights a potential pathway for augmenting LLM capabilities in fact-checking via real-time information retrieval.
Alignment Taxation: The decrease in performance associated with alignment strategies, or the "alignment tax", draws attention to the trade-offs involved in fine-tuning models for human-aligned outputs versus pure factual performance, suggesting current approaches might inadvertently hinder factual capabilities.

Conclusions and Future Directions

Chinese SimpleQA stands as a pioneering benchmark for exploring the factual precision of LLMs within the Chinese language domain. Its introduction not only answers a gap in language-specific model evaluation but also provides groundwork for further research into improving factuality across multilingual AI systems. The dataset's adaptability to cover other languages or modalities, as proposed in the paper, will potentially enrich the field of model evaluation and provide a stronger basis for cross-cultural AI development. Further exploration into improving factual alignment post-training and reducing the alignment tax while maintaining ethical and human-aligned behavior presents a fertile avenue for ongoing research and development.

PDF Markdown Bookmark Chat (Pro)

Authors (18)

Yancheng He (30 papers)
Shilong Li (25 papers)
Jiaheng Liu (100 papers)
Yingshui Tan (23 papers)
Hui Huang (159 papers)
Weixun Wang (31 papers)
Xingyuan Bu (24 papers)
Hangyu Guo (14 papers)
Chengwei Hu (8 papers)
Boren Zheng (4 papers)
Xuepeng Liu (1 paper)
Dekai Sun (2 papers)
Wenbo Su (36 papers)
Bo Zheng (205 papers)
Zhuoran Lin (3 papers)
Shirong Lin (3 papers)
Zhicheng Zheng (6 papers)
Xiaoyong Zhu (12 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/GptMaestro/status/1859636504219484200