Papers
Topics
Authors
Recent
2000 character limit reached

AncientBench: Benchmarking Ancient Chinese Texts

Updated 26 December 2025
  • AncientBench is a comprehensive benchmarking framework that evaluates models on multi-dimensional tasks including glyph, pronunciation, meaning, and contextual comprehension of ancient Chinese texts.
  • It leverages a diverse dataset of excavated sources like oracle bones, bronze inscriptions, bamboo slips, and silk manuscripts to rigorously test language understanding.
  • The framework distinguishes between excavated and transmitted corpora, offering detailed diagnostic analyses that inform advances in archaeological NLP and automated script decipherment.

AncientBench is a comprehensive benchmarking framework designed to evaluate the capabilities of LLMs and associated neural architectures on tasks central to the comprehension of excavated and transmitted ancient Chinese corpora. Addressing domains inadequately served by prior benchmarks—which are predominantly oriented toward modern or hand-transmitted classical texts—AncientBench foregrounds the unique linguistic, graphical, and interpretive challenges encountered in genuine archaeological data, notably those involving unearthed artifacts such as oracle bones, bronze inscriptions, bamboo slips, and silk manuscripts. By operationalizing multidimensional competencies across both textual and visual modalities, AncientBench enables systematic, fine-grained analysis of AI model performance for interdisciplinary applications in paleography, historical linguistics, and archaeology (Zhou et al., 19 Dec 2025).

1. Motivation and Scope

The central motivation for AncientBench stems from the absence of evaluation suites capable of probing LLMs’ performance on genuinely “excavated” ancient texts—defined as unmediated artifacts (e.g., oracle bones, bronzes, bamboo manuscripts) whose scripts predate standardization, lack digital encoding, and display idiosyncratic graphical, phonological, and semantic features. Existing Chinese NLP benchmarks (CLUE, CMMLU, WYWEB, AC-EVAL) focus on modern or “transmitted” corpora (annotated, hand-copied classical texts), omitting excavated sources that present distinctive paleographic and interpretive challenges. This omission hinders both the empirical study of script evolution and practical automation of specialist archaeological workflows (e.g., OCR post-correction, glyph classification, automated translation) (Zhou et al., 19 Dec 2025).

AncientBench is thus constructed to facilitate standardized, automated assessments of ancient character comprehension, promoting LLM development directly relevant to archaeological and historical research.

2. Competency Dimensions and Task Design

AncientBench operationalizes four distinct “comprehension” competencies, each capturing a different foundational cognitive skill required for reading and interpreting ancient Chinese texts:

  • Glyph Comprehension: Identification of graphical components (radicals) and normalization of archaic glyph forms to modern codepoints, addressing severe morphographical variation and eroded/exotic forms.
  • Pronunciation Comprehension: Recovery of reconstructed Old Chinese pronunciations and recognition of phonetic radicals, even in the absence of native audio data, using diachronic dictionaries and analogical reasoning.
  • Meaning Comprehension: Determination of dictionary-style senses in context, distinguishing meanings restricted to excavated corpora from those in later, transmitted corpora; highlighting semantic shift and rare lexical usages.
  • Contextual Comprehension: Integration of the above aspects to solve sentential tasks—blank filling (cloze), identification of phonetic loan characters, and translation between ancient and modern Chinese—requiring syntactic and semantic reasoning over low-frequency forms and ambiguous usages.

AncientBench comprises ten multiple-choice tasks (see Table 1), each with four candidate answers per prompt and mapping directly to these competencies:

Competency Task Name Questions (n)
Glyph Radical Recognition 8,438
Glyph Radical Meaning 1,432
Pronunciation Pronunciation Recognition 3,886
Pronunciation Phonetic Radical 304
Pronunciation Homophone Identification 2,265
Meaning Excavated Doc Word 365
Meaning Transmitted Doc Word 1,504
Contextual Cloze for Excavated Docs 4,875
Contextual Phonetic Loan Character 4,637
Contextual Translation 1,001

This structure supports granular identification of strengths and bottlenecks for any LLM or hybrid system (Zhou et al., 19 Dec 2025).

3. Dataset Curation and Processing

The AncientBench dataset aggregates materials from both “excavated” and “transmitted” sources:

  • Excavated: Oracle bone inscriptions (Shang), bronze inscriptions (Zhou), Bamboo Book of Chu, and silk manuscripts.
  • Transmitted: Canonical references such as Shuowen Jiezi (121 AD), Hanyu Da Cidian (1986), and pre-Qin texts including the Book of Poetry and Analects.

A three-stage digitization pipeline is employed:

  1. High-resolution image processing and computer vision techniques extract radicals and spatial relations, yielding a character knowledge graph.
  2. Unified-font encoding deduplicates Unicode entries and normalizes codepoints.
  3. Novel codepoints are assigned for glyphs absent from official Unicode, leveraging the Private Use Area.

The corpus spans approximately 3,000 BCE to 221 BCE and encompasses ≈100,000 unique tokens after deduplication. In total, AncientBench delivers 28,707 examples covering the ten tasks (Zhou et al., 19 Dec 2025).

4. Evaluation Protocol and Baselines

AncientBench uses strict multiple-choice accuracy as its evaluation metric:

Accuracy=#correct_answers#total_questions\mathrm{Accuracy} = \frac{\#\text{correct\_answers}}{\#\text{total\_questions}}

Competency-level scores are defined as the mean accuracy over tasks in that layer; the overall AncientBench Score is the mean of the four competency-level scores.

Evaluation uses both zero-shot and few-shot prompting:

  • Zero-shot: Plain instruction for answer selection.
  • Few-shot: Instruction plus five Q&A exemplars per prompt.

LLMs’ outputs are ranked based on highest next-token logits among “A”, “B”, “C”, “D”.

Nine LLMs—spanning generic, Chinese-specific, and domain-finetuned (e.g., Yi1.5-9B-Ancient, tuned on all 28,707 QA pairs), as well as relevant specialist models (e.g., Xunzi-Qwen-Chat, Tonggu-7B-Chat)—are compared to a human benchmark composed of 10 archaeology/AI graduate students solving stratified multi-task samples (Zhou et al., 19 Dec 2025).

5. Results and Diagnostic Analysis

Key findings include:

  • Zero-shot Results: Human accuracy (55.13%) slightly exceeds best LLMs (Qwen-14B-Chat at 51.00%). Glyph (76.66% human, 40–53% LLM) and pronunciation (50% human, 25–35% LLM) comprehension are critical gaps. Conversely, LLMs outperform humans on dictionary-context meaning tasks (e.g., 69.71% vs. 38.33% for Transmitted Doc Word). Contextual cloze/translation sees models approach but not surpass human level (55.55% human, 48–62% LLM).
  • Few-shot Performance: Incremental improvements (+1–3%) in most models, with Yi1.5-9B-Ancient gaining +2.94% on glyph comprehension, but pretrained models in some cases exhibit reduced results under few-shot prompts, indicating sensitivity to in-context sampling and few-shot overfitting.
  • Error Analysis: Single-modality text LLMs are bottlenecked on tasks requiring robust visual or phonological grounding (glyph shape recognition, phonetic radical inference). Fine-tuning on ancient-QA pairs improves specialization but may degrade broader semantic capacity (“catastrophic forgetting” across tasks).
  • Human-Like Reasoning: In meaning and contextual tasks that require integrating large-scale pretraining with domain-specific cues, some LLMs demonstrate near-human or superhuman classification, likely due to overfitting prevalent dictionary senses and frequency biases.

AncientBench sets the first benchmark standard for comprehensive, multi-dimensional evaluation of ancient Chinese language comprehension over both excavated and transmitted corpora. Its multidimensional task structure enables in-depth dissection of models’ capabilities across the archaeological, paleographic, and NLP communities.

Extension strategies include:

  • Coverage expansion to additional periods (e.g., Han through proto-Tang), artifact genres (stele inscriptions, murals).
  • Integration of evaluation metrics more sensitive to generative variation (BLEU, F1, edit distance) for translation and open-ended tasks.
  • Development of adversarial/test-time-augmented data (e.g., glyphs with noise, missing strokes) and fine-grained difficulty labeling.
  • Linkage with OCR post-processing and glyph restoration automata (e.g., RZCR, CharFormer).
  • Exploration of multimodal and self-supervised pretraining pathways to support visual grounding and pronunciation induction without sacrificing language generalization.

A closely related paradigm is seen in the Ancient Plant Seed (APS) + APSNet benchmark, which extends “AncientBench” toward artifact-based, size-and-shape-aware classification in archaeobotany, further emphasizing AncientBench’s role as a blueprint for multidomain, empirically robust benchmarking of archaeological micro-artifacts (Xing et al., 20 Dec 2025). Comparing to OBI-Bench (Chen et al., 2 Dec 2024), which targets multimodal perceptual and interpretive tasks on oracle bones, AncientBench is distinguished by its focus on textual, linguistic, and paleographic multicompetency evaluation.

7. Future Prospects and Scholarly Impact

AncientBench is positioned to drive further research in several domains:

  • Development of multimodal LLM architectures capable of ingesting complex visual inputs such as rubbings, bone fragments, and paleographic facsimiles.
  • Iterative improvement of LLMs for archaeological applications, supporting tasks such as automated translation, glyph restoration, and script decipherment, thus decreasing specialists’ manual effort.
  • Diagnostic capability for identifying granular deficiencies in model understanding of genuine ancient scripts, informing targeted algorithmic and pretraining interventions.
  • Facilitation of systematic, large-scale comparative studies in historical script evolution, semantic shift, and literacy in early Chinese civilization.

By providing an open, reproducible, and diagnostically rich benchmark suite, AncientBench enables the community to make measurable progress in both methodology and application for ancient language comprehension and the digital humanities (Zhou et al., 19 Dec 2025).


Primary sources: (Zhou et al., 19 Dec 2025, Xing et al., 20 Dec 2025, Chen et al., 2 Dec 2024)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to AncientBench.