ChangAn Benchmark: AI-Generated Chinese Poetry Detection

Updated 10 June 2026

ChangAn is a benchmark for evaluating AI-generated versus human-written classical Chinese poetry using curated datasets and strict metrical constraints.
It employs both decision-based and probability-based detectors across single and multi-poem settings to reveal performance differences and limitations.
The benchmark supports robust analysis with detailed evaluation metrics, rigorous dataset partitioning, and explicit modeling of classical poetic structures.

ChangAn is a large-scale benchmark developed for evaluating the detection of LLM-generated classical Chinese poetry. It addresses the critical challenge of distinguishing AI-generated verses from those written by contemporary human poets, considering the unique textual properties and formal constraints of classical Chinese poetry. The benchmark enables systematic assessment of both decision-based and probability-based detection methods, and exposes the limitations of current state-of-the-art detectors in this highly specialized literary domain (Li et al., 11 Apr 2026).

1. Dataset Construction and Composition

ChangAn comprises a total of 30,664 classical Chinese poems, meticulously curated to represent both human and AI authorship. It contains 10,276 human-written poems authored by 282 modern poets and enthusiasts. The AI-generated subset consists of 20,388 poems, evenly sourced from four prominent LLMs: Doubao Seed-1.6 (5,136 poems), DeepSeek-V3.2 (5,138), Kimi-K2 (5,017), and GPT-4.1 (5,097). The genre distribution within the human-set includes 17,193 Ci, 3,773 jueju, and 9,699 lvshi. The corpus covers 1,802,646 characters and 307,760 sentences.

Human poems were collected from social platforms (Xiaohongshu, Baidu Tieba) and literary publications, with explicit attribution of poet names or pseudonyms for transparency. Ancient canonical works are excluded to prevent data-leakage, as such texts are often included in LLM pre-training. AI poems were produced via two modes: Direct Generation, where each LLM generated a poem given a seed title (drawn from 2,569 human poems), and Critique-Driven Refinement, wherein the LLM critiqued and subsequently revised its own draft. Generation parameters enforced strict adherence to classical metrical patterns, including prescribed rhyme, tonal arrangement, and parallelism. JSON formatting and post-processing removed invalid or duplicate outputs, and ensured metrical correctness. Human/AI labels are intrinsic to the data source; no further manual annotation was required beyond metrical verification (Li et al., 11 Apr 2026).

2. Dataset Splits and Granularity

For supervised models such as RoBERTa and specialized AI detectors, ChangAn is split randomly into training, validation, and test sets with an 8:1:1 ratio, covering the full 30,664 poems. Zero-shot and decision-based detector evaluations use the subset of 2,569 human seed poems for prompt construction and a 28,095-poem testbed (7,707 remaining human poems and all 20,388 AI poems).

Detection tasks are structured at multiple text granularities. Single-Poem Detection (SPD) classifies one poem at a time. Multi-Poem Detection (MPD) evaluates aggregated batches of 6 (MPD-6) or 12 (MPD-12) poems, simulating the classification of an author’s mini-anthology and leveraging stylistic consistency across a set. Results are consistently reported at all three granularities, revealing distinct performance effects (Li et al., 11 Apr 2026).

3. Evaluation Metrics and Experimental Protocols

ChangAn adheres to established classification metrics with rigor:

Precision ( $P = \mathrm{TP} / (\mathrm{TP} + \mathrm{FP})$ ), recall ( $R = \mathrm{TP} / (\mathrm{TP} + \mathrm{FN})$ ), F1 ($2PR/(P+R)$), and accuracy ( $A = (\mathrm{TP} + \mathrm{TN}) / (\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN})$ ) are computed with true/false positive/negative counts for the “AI” class.
Macro-recall and macro-F1 average class-wise recall and F1, respectively.
Probability-based detectors (e.g., those using log-likelihoods) are additionally evaluated by Area Under the Receiver Operating Characteristic Curve (AUROC).

Decision-based detectors (LLMs functioning as black-box classifiers) report accuracy, recall_AI, recall_human, and macro-F1, while probability-based methods report AUROC and macro-F1. For fine-tuned Chinese RoBERTa experiments, training uses three epochs, a learning rate of 1e-4, and batch size 16. Experiments are repeated three times on A100 GPUs and results averaged to ensure statistical consistency (Li et al., 11 Apr 2026).

4. Baseline Detector Types and Empirical Results

Twelve baseline detectors are systematically evaluated:

Decision-based (LLM) Detectors: DeepSeek-V3.2, Kimi-K2, Doubao Seed-1.6, GPT-4.1, and LLM-Detector-Small-zh.
Probability-based Detectors: Log-Likelihood (Qwen2.5-3B), Log-Rank (GLTR style), LRR (Log-Rank Ratio), Fast-DetectGPT, RoBERTa (supervised), AIGC-Detector-v3 (Zh-v3), AIGC-Detector-v3-short.

Summary of key performance results:

Detector Type	Representative Method	Metric / Value
Decision-based (LLM)	DeepSeek-V3.2	Acc. ≈ 39.6 %, Macro-F1 ≈ 39.4 %
Decision-based (LLM)	GPT-4.1	Acc. ≈ 26 %, Macro-F1 ≈ 23
Probability-based	Log-Likelihood	AUROC ≈ 83.9 %, Macro-F1 ≈ 73.8 %
Probability-based	Log-Rank	AUROC ≈ 85.9 %, Macro-F1 ≈ 75.6 %
Probability-based	RoBERTa (fine-tuned)	AUROC ≈ 95.0 %, Macro-F1 ≈ 86.2 %
Probability-based	AIGC-Detector-v3	AUROC ≈ 72.4 %
Probability-based	Fast-DetectGPT	AUROC ≈ 49.7 % (fails on poetry)

Decision-based LLMs perform near random, with average accuracy ≈ 32 % and recall_AI ≈ 16 %. DeepSeek-V3.2 performs best among LLMs; GPT-4.1 is the weakest. Probability-based detectors—especially those supervised on ChangAn (fine-tuned RoBERTa)—achieve superior results (AUROC ≈ 95 %), but even other probability-based metrics (e.g., Log-Likelihood, Log-Rank) substantially outperform decision-based methods. Fast-DetectGPT, which leverages conditional probability curvature, fails on classical poetry.

LLM self-recognition (detecting one's own generations) is ineffective. For instance, Doubao Seed-1.6 recall on its own outputs is ≈ 16.1 % versus ≈ 37.3 % on GPT-4.1 outputs; GPT-4.1 recall is consistently below 7 % for any author. Multi-poem aggregation boosts performance: average decision-based accuracy rises from 32 % (SPD) to 56 % (MPD-6/12). Probability-based AUROC improves from ≈ 75.7 % (SPD) to ≈ 88 % (MPD-6/12), though the incremental benefit diminishes for larger batches. Critique-driven refinement increases detection challenge, lowering AUROC by ≈ 9.8 percentage points (from ≈ 80.9 % to ≈ 71.1 %) and slightly reducing decision-based recall_AI (Li et al., 11 Apr 2026).

5. Analysis of Detection Challenges and Failure Modes

The limitations of automated detection are attributed to several domain-specific features:

Rigid meter, rhyme, and parallelism eliminate the stylistic “noise” detectors often leverage in unconstrained prose.
Both LLMs and humans employ a deeply shared imagery lexicon (e.g., moon, river, wind, distance), eroding lexical distinctiveness.
Flexible classical Chinese syntax and archaic grammar decrease the density and utility of discriminative n-gram statistics.
Critique-driven refinement generates probability distributions closer to those produced by humans, masking the telltale artifacts exposed by most detectors.

A plausible implication is that models tailored for prose or modern text genres are fundamentally mismatched when applied to highly-constrained literary forms such as classical poetry (Li et al., 11 Apr 2026).

6. Directions for Improvement and Resource Availability

Recommendations for advancing detection include explicit modeling of metrical and prosodic features (e.g., tone-pattern encodings), construction of high-order imagery networks (modeling the co-occurrence of classical allusions), adversarial training using critique-refined prompts, and development of multi-granularity ensembles integrating character-, line-, and poem-level representations. Exploration of hybrid models combining decision-based reasoning with statistical anomaly detection is also advised.

All ChangAn resources—including dataset, code, prompt templates, and evaluation scripts—are publicly released under a CC-BY-NC license at https://github.com/VelikayaScarlet/ChangAn, facilitating reproducibility and further methodological research in the domain of generative literary text detection (Li et al., 11 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ChangAn Benchmark.

ChangAn Benchmark: AI-Generated Chinese Poetry Detection

1. Dataset Construction and Composition

2. Dataset Splits and Granularity

3. Evaluation Metrics and Experimental Protocols

4. Baseline Detector Types and Empirical Results

5. Analysis of Detection Challenges and Failure Modes

6. Directions for Improvement and Resource Availability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ChangAn Benchmark: AI-Generated Chinese Poetry Detection

1. Dataset Construction and Composition

2. Dataset Splits and Granularity

3. Evaluation Metrics and Experimental Protocols

4. Baseline Detector Types and Empirical Results

5. Analysis of Detection Challenges and Failure Modes

6. Directions for Improvement and Resource Availability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research