Papers
Topics
Authors
Recent
Search
2000 character limit reached

Chinese Hard Case Benchmark (CHC-Bench#1)

Updated 7 April 2026
  • CHC-Bench#1 is a rigorous evaluation suite that tests LLMs on complex, culturally and historically nuanced Chinese language tasks.
  • It incorporates Gaokao-level STEM problems, advanced linguistic challenges, and diverse domains including writing, role-playing, and coding.
  • The benchmark employs a multi-dimensional GPT-4 scoring rubric across six axes to provide detailed performance insights for specialized Chinese tasks.

The Chinese Hard Case Benchmark (CHC-Bench#1) is a multidisciplinary, rigorously filtered evaluation suite created to assess the advanced Chinese language capabilities of LLMs. Unlike general-purpose benchmarks, CHC-Bench#1 was explicitly designed to test LLMs on “hard cases”: tasks demanding deep knowledge of Chinese culture, advanced STEM competence at the level of China’s national college exams (Gaokao), and the ability to navigate linguistic detail and nuance that remain intractable to non-specialized models. Its construction was guided by criteria of cultural/historical specificity, pedagogical rigor, and fine-grained linguistic challenge, resulting in a testbed suitable for the detailed quantification and comparison of Chinese-centric LLMs (Du et al., 2024).

1. Benchmark Foundations and Motivations

CHC-Bench#1 emerged to address the critical gap in LLM evaluation for complex, high-level Chinese-language understanding—a gap not filled by benchmarks recycled from English or dominated by simple Q&A. Its founding principles are:

  • Cultural and historical specificity: Problems requiring familiarity with classical and modern Chinese literature, historical events, idioms, and regulated poetic forms.
  • Pedagogical rigor: Inclusion of authentic items from the Chinese Gaokao, targeting upper-bound high-school mathematics, physics, chemistry, and biology.
  • Linguistic nuance: Coverage of hard Chinese linguistic phenomena, such as ancient grammar, pinyin/tone disambiguation, and contemporary internet slang.

Benchmarked tasks were intentionally curated to push LLMs beyond routine capability, with every instance considered “hard” either due to its provenance (e.g., national exams), low pilot performance by 2B-parameter models, or specialized nature (Du et al., 2024).

2. Dataset Composition and Curation

The dataset underlying CHC-Bench#1 comprises 214 single-turn problem instances distributed across eight top-level categories. Each prompt typically ranges from 30 to 100 Chinese characters, yielding a total corpus of approximately 20,000–30,000 characters. No synthetic prompts were used; all items originated from authoritative or curated sources and underwent manual filtering for challenge and instructional format. The dataset breakdown is presented below.

Category Subcategories Total Questions
Writing Official docs, Advertisements, Poetry, Creative 33
Humanities Historical common sense, Geography, History 20
Science Physics, Chemistry, Biology (Gaokao) 20
Role-playing 20 canonical characters (e.g., Sun Wukong, etc.) 20
Reading Comp. Chinese (Gaokao), Info understanding, Argument 30
Math Elementary/Middle, Gaokao, College-level 34
Hard Cases Ancient Chinese, Pronunciation, Slang 37
Coding Chinese command–driven code, Translation, Debug 20

Candidate problems were sourced from:

  • Fengshenbang/Ziya (humanities)
  • Past Gaokao papers (language/STEM)
  • CIF-Bench and lexicons (slang, linguistic phenomena)

The inclusion criteria demanded that each item (a) require non-trivial Chinese knowledge, (b) fit an instruction-following paradigm, and (c) contribute to comprehensive coverage. Each entry was retained only if it satisfied a “hard-case” criterion (Du et al., 2024).

3. Evaluation Methodology and Metrics

CHC-Bench#1 eschews simplistic accuracy or F1F_1 measures for most tasks, instead adopting a multi-dimensional rubric inspired by MT-Bench and MT-Judge. The evaluation procedure is as follows:

  • Each model output is rated by GPT-4 on six axes: usefulness, relevance, accuracy, depth, creativity, and level of detail.
  • For a given item ii, a score ri[1,10]r_i \in [1, 10] is assigned.
  • The model’s aggregate CHC-Bench score is the mean:

Scoremodel=1Ni=1Nri\mathrm{Score}_{model} = \frac{1}{N} \sum_{i=1}^N r_i

with NN = 214.

No weighting or penalty terms modify this mean. An item is retained in the benchmark only if it elicits an average score < 5.0 by pilot runs with off-the-shelf 2B-parameter models, or if it has “hard” cultural/linguistic content (Du et al., 2024).

Qualitative examples in the benchmark span poetic composition, advanced probability, chemistry conceptual knowledge, character role-play, and modern internet slang explanation.

4. Empirical Results and Model Performance

Results for leading ~2B-parameter models on CHC-Bench#1 demonstrate significant variance in Chinese-language capability:

Model Overall Hard Cases Social/Hum. Coding Writing Roleplay Math Reading Science
CT-LLM#1 3.99 3.05 5.00 4.05 4.55 4.10 3.21 4.93 3.50
MiniCPM-2B 6.95 6.81 7.30 8.55 9.00 7.05 5.18 6.33 5.70
Deepseek-coder-1.3B 3.03 1.92 2.05 6.70 3.09 2.60 2.21 4.73 1.60
Other (Bloom, etc.) 1.40–3.31 1.24–3.16 1.35–4.60 1.00–2.70 1.09–3.36 1.35–3.75 1.15–3.12 2.43–5.47 1.40–2.75

CT-LLM#1, developed with 800B Chinese tokens, outperforms Bloom, Gemma, TinyLlama, and is only narrowly below Deepseek-coder and Stablelm. Strengths for CT-LLM#1 lie in social/humanities and writing tasks; its performance is weakest on Gaokao-level STEM and hard-case language instances. Even higher CHC-Bench scores by models like MiniCPM-2B may reflect broader multilingual instruction tuning rather than pure Chinese skill (Du et al., 2024).

No formal significance testing is reported, but >1-point gaps between CT-LLM#1 and the top models suggest that further targeted Chinese instruction tuning is needed for parity at this scale.

5. Representative Task Types and Examples

CHC-Bench#1’s task types are drawn from diverse, high-difficulty sources:

  • Writing (Poetry/Couplet): Prompt: “以‘夏至’为节气写一副对联。” Reference answer: “上联 夏至苍穹云影薄,下联 清风晓起露华浓。”
  • Math (Gaokao-level Probability): Prompt: P(good air quality, 1 day)=0.75P(\text{good air quality, 1 day})=0.75, P(good 2 days)=0.60P(\text{good 2 days})=0.60. Given today is good, P(tomorrow is good)P(\text{tomorrow is good})? Reference: $0.60/0.75=0.80$.
  • Science (Chemistry): Prompt: “Which disinfectant’s active ingredient is a salt?...” Reference: “Potassium permanganate is a salt.”
  • Role-play: Prompt to emulate canonical figures such as Tang Sanzang with first-person responses referencing classical trial motifs.
  • Hard Cases: Explanation of current internet slang (“小镇做题家”) for Chinese learners.

This variety enforces broad coverage of advanced knowledge and deep structural understanding (Du et al., 2024).

6. Benchmark Evolution and Future Directions

Authors propose several enhancements to keep CHC-Bench#1 at the frontier of Chinese LLM evaluation:

  1. Expansion to Multi-turn/Conversational Subtasks: To test interactive instruction following beyond single-turn prompts.
  2. Increased Coverage of Advanced Gaokao Items: Including Chinese literature analysis and advanced calculus.
  3. Integration of Human-Annotated Reference Answers: Enables automated string-matching metrics (e.g., BLEU, ROUGE) alongside GPT-4 ratings for hybrid evaluation.
  4. Periodic Difficulty Calibration: Re-running pilot tests on newest open-source Chinese LLMs ensures the benchmark continues to challenge state-of-the-art models.

These strategies are aimed at maintaining the benchmark’s “hard” frontier as LLM capabilities progress (Du et al., 2024).

7. Significance and Impact

CHC-Bench#1 has established itself as the de facto yardstick for quantifiable, cross-model comparison of advanced Chinese language ability in LLMs. It prompts and reveals weaknesses in areas typically occluded in multilingual or English-centric evaluations, particularly in cultural fidelity, STEM application, and linguistic sophistication. The explicit demonstration of CT-LLM#1’s strengths and gaps underscores the need for continued Chinese-specific tuning and hard-case data augmentation. A plausible implication is that benchmarks oriented towards real-world, high-stakes use cases can drive both architectural and data-centric advances in domain-specialized LLMs (Du et al., 2024).

By making the benchmark and evaluation procedure open, the developers facilitate ongoing, systematic progress tracking and model innovation for Chinese NLP research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chinese Hard Case Benchmark (CHC-Bench#1).