Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation (2505.05225v1)

Published 8 May 2025 in cs.CL

Abstract: The rapid advancement of Chinese LLMs underscores the need for domain-specific evaluations to ensure reliable applications. However, existing benchmarks often lack coverage in vertical domains and offer limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for human expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, with data selections grounded in 24 Chinese qualifications to closely align with national policies and working standards. Through comprehensive evaluation, the Qwen2.5 model outperformed the more advanced GPT-4o, with Chinese LLMs consistently surpassing non-Chinese models, highlighting the importance of localized domain knowledge in meeting qualification requirements. The best performance of 75.26% reveals the current gaps in domain coverage within model capabilities. Furthermore, we present the failure of LLM collaboration with crowdsourcing mechanisms and suggest the opportunities for multi-domain RAG knowledge enhancement and vertical domain LLM training with Federated Learning.

QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

The paper "QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation" addresses the critical need for domain-specific evaluation frameworks in assessing the performance of Chinese LLMs. As Chinese LLMs continue to evolve and integrate into various applications, it is essential to have standardized and relevant benchmarks that reflect their interactions within a local context. This paper introduces QualBench, a novel multi-domain Chinese QA benchmark specifically designed to evaluate LLMs against localized professional qualifications, thereby filling significant gaps in current benchmarking practices.

Core Contributions and Findings

The paper's primary contribution is the construction of a comprehensive dataset encompassing over 17,000 questions derived from 24 professional qualification exams across six vertical domains, including production safety, civil engineering, fire safety, oil and gas, economics and finance, and banking and insurance. This strategic alignment with national standards ensures the benchmark’s relevance to the Chinese context and enhances the evaluation's realism.

Notably, the evaluation results reveal that localized models, such as Qwen2.5, outperform more globally-oriented counterparts like GPT-4o. The best model achieved an accuracy of 75.26%, indicating a marginally satisfactory pass rate but highlighting substantial room for improvement in certain domain-specific applications.

Furthermore, Chinese LLMs consistently surpass non-Chinese models, emphasizing the importance of localized domain knowledge for effective application within the Chinese environment. This observation underscores the necessity for benchmarks that not only account for linguistic capabilities but also reflect cultural and regulatory nuances that impact practical utility.

Additional Insights and Implications

The paper discusses the limitations of existing benchmarks, which predominantly focus on language capabilities without sufficient coverage of domain-specific knowledge. By integrating professional qualification exams into the evaluation framework, QualBench provides a more holistic approach that aligns model capabilities with human expertise and regulatory standards.

Moreover, the paper highlights the inadequacy of LLM collaboration mechanisms like crowdsourcing, which were unable to outperform a single strong model. This highlights the potential for optimizing aggregation techniques that can better leverage multi-LLM responses.

Future Directions and Potential Developments

The authors propose two promising avenues for further research and development:

  1. Retrieval Augmented Generation (RAG): Incorporating external knowledge bases to enhance LLM performance in domain-specific tasks. This approach involves constructing cross-domain knowledge graphs to facilitate effective retrieval and contextually-grounded answer generation.
  2. Federated Learning for Domain-Specific Training: Harnessing private domain data through federated learning can address data scarcity issues and improve domain knowledge coverage without compromising data privacy and security.

The paper demonstrates a foundational effort to push the boundaries of Chinese LLM evaluation by addressing localization and domain coverage comprehensively. As LLMs continue to advance, QualBench provides a critical framework for assessing their readiness for deployment in specialized fields within China. Such endeavors are vital for ensuring that AI systems not only understand language intricacies but also embody the technical precision and contextual knowledge necessary for real-world applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mengze Hong (11 papers)
  2. Wailing Ng (3 papers)
  3. Di Jiang (42 papers)
  4. Chen Jason Zhang (25 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com