Chinese Hard Case Benchmark (CHC-Bench#1)

Updated 7 April 2026

CHC-Bench#1 is a rigorous evaluation suite that tests LLMs on complex, culturally and historically nuanced Chinese language tasks.
It incorporates Gaokao-level STEM problems, advanced linguistic challenges, and diverse domains including writing, role-playing, and coding.
The benchmark employs a multi-dimensional GPT-4 scoring rubric across six axes to provide detailed performance insights for specialized Chinese tasks.

The Chinese Hard Case Benchmark (CHC-Bench#1) is a multidisciplinary, rigorously filtered evaluation suite created to assess the advanced Chinese language capabilities of LLMs. Unlike general-purpose benchmarks, CHC-Bench#1 was explicitly designed to test LLMs on “hard cases”: tasks demanding deep knowledge of Chinese culture, advanced STEM competence at the level of China’s national college exams (Gaokao), and the ability to navigate linguistic detail and nuance that remain intractable to non-specialized models. Its construction was guided by criteria of cultural/historical specificity, pedagogical rigor, and fine-grained linguistic challenge, resulting in a testbed suitable for the detailed quantification and comparison of Chinese-centric LLMs (Du et al., 2024).

1. Benchmark Foundations and Motivations

CHC-Bench#1 emerged to address the critical gap in LLM evaluation for complex, high-level Chinese-language understanding—a gap not filled by benchmarks recycled from English or dominated by simple Q&A. Its founding principles are:

Cultural and historical specificity: Problems requiring familiarity with classical and modern Chinese literature, historical events, idioms, and regulated poetic forms.
Pedagogical rigor: Inclusion of authentic items from the Chinese Gaokao, targeting upper-bound high-school mathematics, physics, chemistry, and biology.
Linguistic nuance: Coverage of hard Chinese linguistic phenomena, such as ancient grammar, pinyin/tone disambiguation, and contemporary internet slang.

Benchmarked tasks were intentionally curated to push LLMs beyond routine capability, with every instance considered “hard” either due to its provenance (e.g., national exams), low pilot performance by 2B-parameter models, or specialized nature (Du et al., 2024).

2. Dataset Composition and Curation

The dataset underlying CHC-Bench#1 comprises 214 single-turn problem instances distributed across eight top-level categories. Each prompt typically ranges from 30 to 100 Chinese characters, yielding a total corpus of approximately 20,000–30,000 characters. No synthetic prompts were used; all items originated from authoritative or curated sources and underwent manual filtering for challenge and instructional format. The dataset breakdown is presented below.

Category	Subcategories	Total Questions
Writing	Official docs, Advertisements, Poetry, Creative	33
Humanities	Historical common sense, Geography, History	20
Science	Physics, Chemistry, Biology (Gaokao)	20
Role-playing	20 canonical characters (e.g., Sun Wukong, etc.)	20
Reading Comp.	Chinese (Gaokao), Info understanding, Argument	30
Math	Elementary/Middle, Gaokao, College-level	34
Hard Cases	Ancient Chinese, Pronunciation, Slang	37
Coding	Chinese command–driven code, Translation, Debug	20

Candidate problems were sourced from:

Fengshenbang/Ziya (humanities)
Past Gaokao papers (language/STEM)
CIF-Bench and lexicons (slang, linguistic phenomena)

The inclusion criteria demanded that each item (a) require non-trivial Chinese knowledge, (b) fit an instruction-following paradigm, and (c) contribute to comprehensive coverage. Each entry was retained only if it satisfied a “hard-case” criterion (Du et al., 2024).

3. Evaluation Methodology and Metrics

CHC-Bench#1 eschews simplistic accuracy or $F_1$ measures for most tasks, instead adopting a multi-dimensional rubric inspired by MT-Bench and MT-Judge. The evaluation procedure is as follows:

Each model output is rated by GPT-4 on six axes: usefulness, relevance, accuracy, depth, creativity, and level of detail.
For a given item $i$ , a score $r_i \in [1, 10]$ is assigned.
The model’s aggregate CHC-Bench score is the mean:

$\mathrm{Score}_{model} = \frac{1}{N} \sum_{i=1}^N r_i$

with $N$ = 214.

No weighting or penalty terms modify this mean. An item is retained in the benchmark only if it elicits an average score < 5.0 by pilot runs with off-the-shelf 2B-parameter models, or if it has “hard” cultural/linguistic content (Du et al., 2024).

Qualitative examples in the benchmark span poetic composition, advanced probability, chemistry conceptual knowledge, character role-play, and modern internet slang explanation.

4. Empirical Results and Model Performance

Results for leading ~2B-parameter models on CHC-Bench#1 demonstrate significant variance in Chinese-language capability:

Model	Overall	Hard Cases	Social/Hum.	Coding	Writing	Roleplay	Math	Reading	Science
CT-LLM#1	3.99	3.05	5.00	4.05	4.55	4.10	3.21	4.93	3.50
MiniCPM-2B	6.95	6.81	7.30	8.55	9.00	7.05	5.18	6.33	5.70
Deepseek-coder-1.3B	3.03	1.92	2.05	6.70	3.09	2.60	2.21	4.73	1.60
Other (Bloom, etc.)	1.40–3.31	1.24–3.16	1.35–4.60	1.00–2.70	1.09–3.36	1.35–3.75	1.15–3.12	2.43–5.47	1.40–2.75

CT-LLM#1, developed with 800B Chinese tokens, outperforms Bloom, Gemma, TinyLlama, and is only narrowly below Deepseek-coder and Stablelm. Strengths for CT-LLM#1 lie in social/humanities and writing tasks; its performance is weakest on Gaokao-level STEM and hard-case language instances. Even higher CHC-Bench scores by models like MiniCPM-2B may reflect broader multilingual instruction tuning rather than pure Chinese skill (Du et al., 2024).

No formal significance testing is reported, but >1-point gaps between CT-LLM#1 and the top models suggest that further targeted Chinese instruction tuning is needed for parity at this scale.

5. Representative Task Types and Examples

CHC-Bench#1’s task types are drawn from diverse, high-difficulty sources:

Writing (Poetry/Couplet): Prompt: “以‘夏至’为节气写一副对联。” Reference answer: “上联夏至苍穹云影薄，下联清风晓起露华浓。”
Math (Gaokao-level Probability): Prompt: $P(\text{good air quality, 1 day})=0.75$ , $P(\text{good 2 days})=0.60$ . Given today is good, $P(\text{tomorrow is good})$ ? Reference: $0.60/0.75=0.80$.
Science (Chemistry): Prompt: “Which disinfectant’s active ingredient is a salt?...” Reference: “Potassium permanganate is a salt.”
Role-play: Prompt to emulate canonical figures such as Tang Sanzang with first-person responses referencing classical trial motifs.
Hard Cases: Explanation of current internet slang (“小镇做题家”) for Chinese learners.

This variety enforces broad coverage of advanced knowledge and deep structural understanding (Du et al., 2024).

6. Benchmark Evolution and Future Directions

Authors propose several enhancements to keep CHC-Bench#1 at the frontier of Chinese LLM evaluation:

Expansion to Multi-turn/Conversational Subtasks: To test interactive instruction following beyond single-turn prompts.
Increased Coverage of Advanced Gaokao Items: Including Chinese literature analysis and advanced calculus.
Integration of Human-Annotated Reference Answers: Enables automated string-matching metrics (e.g., BLEU, ROUGE) alongside GPT-4 ratings for hybrid evaluation.
Periodic Difficulty Calibration: Re-running pilot tests on newest open-source Chinese LLMs ensures the benchmark continues to challenge state-of-the-art models.

These strategies are aimed at maintaining the benchmark’s “hard” frontier as LLM capabilities progress (Du et al., 2024).

7. Significance and Impact

CHC-Bench#1 has established itself as the de facto yardstick for quantifiable, cross-model comparison of advanced Chinese language ability in LLMs. It prompts and reveals weaknesses in areas typically occluded in multilingual or English-centric evaluations, particularly in cultural fidelity, STEM application, and linguistic sophistication. The explicit demonstration of CT-LLM#1’s strengths and gaps underscores the need for continued Chinese-specific tuning and hard-case data augmentation. A plausible implication is that benchmarks oriented towards real-world, high-stakes use cases can drive both architectural and data-centric advances in domain-specialized LLMs (Du et al., 2024).

By making the benchmark and evaluation procedure open, the developers facilitate ongoing, systematic progress tracking and model innovation for Chinese NLP research.

Markdown Report Issue Upgrade to Chat

References (1)

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chinese Hard Case Benchmark (CHC-Bench#1).

Chinese Hard Case Benchmark (CHC-Bench#1)

1. Benchmark Foundations and Motivations

2. Dataset Composition and Curation

3. Evaluation Methodology and Metrics

4. Empirical Results and Model Performance

5. Representative Task Types and Examples

6. Benchmark Evolution and Future Directions

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Chinese Hard Case Benchmark (CHC-Bench#1)

1. Benchmark Foundations and Motivations

2. Dataset Composition and Curation

3. Evaluation Methodology and Metrics

4. Empirical Results and Model Performance

5. Representative Task Types and Examples

6. Benchmark Evolution and Future Directions

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research