Single-Poem Detection (SPD)

Updated 10 June 2026

Single-Poem Detection (SPD) is the task of automatically authenticating a single poem as human-written or LLM-generated using statistical cues, stylometric artifacts, and semantic signals.
SPD research leverages curated benchmark datasets like AIGenPoetry and ChangAn, which provide precise human and AI poem corpora for robust in-domain and out-of-domain evaluations.
Algorithmic approaches span from probabilistic methods and supervised neural classifiers to multimodal detectors that integrate text and image data for enhanced detection accuracy.

Single-Poem Detection (SPD) is the task of automatically determining, from a single poem, whether it was authored by a human or generated by a LLM. Unlike multi-document detection, SPD operates at the extreme of input sparsity, where statistical cues, stylometric artifacts, and deep semantic signals are all highly unstable. The proliferation of LLM-generated poems—particularly in Chinese literary domains—has intensified the urgency of SPD for the authentication of poetic works and the preservation of creative ecosystems (Wang et al., 1 Sep 2025, Li et al., 11 Apr 2026, Wang et al., 21 May 2026).

1. Benchmark Datasets for SPD

SPD research depends on finely curated datasets which precisely separate human-written from AI-generated poetry. The two established benchmarks are:

AIGenPoetry (Modern Chinese Poetry): 800 poems by six professional poets supply the human corpus, and 41,600 LLM-generated counterparts (OpenAI GPT-4.1, GLM-4, DeepSeek-V3, DeepSeek-R1) are produced via 13 distinct prompts per human original, spanning stylistic imitation, stanza/line control, and explicit emotion targeting (Wang et al., 1 Sep 2025, Wang et al., 21 May 2026).
ChangAn (Classical Chinese Poetry): 10,276 human poems by 282 poets, plus 20,388 LLM-generated (from DeepSeek-V3.2, GPT-4.1, Kimi-K2, Doubao Seed-1.6). Poems span Ci, jueju, and lüshi forms. Prompting includes direct generation and critique-driven refinement to enforce metrical and stylistic constraints (Li et al., 11 Apr 2026).

Key dataset properties:

Benchmark	Human Poems	AI Poems	Forms/Styles	Noteworthy Controls
AIGenPoetry	800	41,600	Modern, free-form	Provenance, prompt stratification
ChangAn	10,276	20,388	Ci, jueju, lüshi	Critique-refinement, form diversity

Quality assurance is ensured by rigorous provenance of human texts and spot-checking or prompt validation for LLM outputs. Data splits are stratified to support both in-domain and out-of-domain (generalization) experiments.

2. Algorithmic Approaches to SPD

SPD detection methodologies fall into three categories: probabilistic/statistical, supervised classification, and multimodal (image-semantic) models.

Probability/Rank-based Detectors: Fast-DetectGPT uses probability manifold curvature to differentiate AI outputs. Log-Likelihood and Log-Rank (GLTR) exploit tokenwise model scores and prediction rank distributions. LRR (Log-Rank Ratio) compares likelihoods under a suspect LLM versus a reference model. These methods offer modest discriminative power, particularly where LLMs closely mimic human syntactic or statistical distributions (Wang et al., 1 Sep 2025, Li et al., 11 Apr 2026).
Supervised Neural Classifiers: RoBERTa-based models (Chinese-RoBERTa-wwm-ext, Roberta-ZH) are fine-tuned on large balanced poem datasets. These capture both low-level lexical cues and high-level stylistic/semantic nuances. Such models achieve markedly superior Macro-F1 and AUROC compared to zero-shot statistical approaches (Wang et al., 1 Sep 2025, Li et al., 11 Apr 2026, Wang et al., 21 May 2026).
LLM-based & Multimodal Detectors: Large vision-LLMs (MLLMs), as in the IMAGINE pipeline, jointly process poem text and a semantically aligned image representing the poem’s core imagery. Cross-modal attention allows detectors such as Gemini-3 to achieve state-of-the-art accuracy, exploiting “seeing→feeling→writing” correspondences seen in Chinese poetic tradition (Wang et al., 21 May 2026).

Method Family	Example Detectors	Macro-F1 (Modern)	Macro-F1 (Classical)	Key Properties
Probability/Rank	Log-Likelihood, GLTR	~67-68%	~75%	Unsupervised, zero-shot
Supervised Classifier	RoBERTa, Roberta-ZH	~91% (modern)	~86%	Needs training, high flexibility
Multimodal	Gemini+IMAGINE	85.65%	—	Leverages text–image alignment

A plausible implication is that image-semantic integration compensates for stylistic ambiguity at the text level by revealing thematic and emotional congruence.

3. Empirical Results and Diagnostic Analysis

Single-Poem Detection is notably more difficult than multi-poem aggregation. In modern Chinese poetry (AIGenPoetry baseline), classic detectors (Fast-DetectGPT, Log-Likelihood, Log-Rank) converge around 70% AUC-ROC and high 60s F1. Binoculars shows similar performance. RoBERTa surpasses these with AUC-ROC 91.4%, F1 91.2% (Wang et al., 1 Sep 2025). In classical poetry (ChangAn), Roberta-ZH achieves Macro-F1 86.18% and AUROC 95.03%; statistical methods degrade under critique-driven refinement (drops –17 AUROC) (Li et al., 11 Apr 2026).

Style imitation, especially with powerful models like GPT-4.1, causes detectors’ F1 to drop below 50% on stylistic challenge prompts. Explicit-emotion prompts are easier; when emotion-words appear overtly, neural classifiers approach near-perfect separation.

Incorporation of image semantics (IMAGINE/Gemini) yields Macro-F1 85.65%—outperforming text-only RoBERTa by over a full percentage point and traditional zero-shot methods by >35 pp. Ablation shows that prompt examples and image features both yield nontrivial additive benefits (Wang et al., 21 May 2026).

4. Fundamental Challenges in SPD

SPD faces inherent barriers arising from data size, poetic heterogeneity, and LLM capabilities:

Short Text Length: Poems typically span 8–20 (modern) or ~10 lines of 5–7 characters (classical), offering limited tokens for statistical anomaly detection or stylometric aggregation (Wang et al., 1 Sep 2025, Li et al., 11 Apr 2026).
Style Variability: Modern forms emphasize syntactic violation and idiosyncratic punctuation; LLMs have learned to mimic these, undermining cues used in prose detection.
Semantic Depth: Human poets embed multilayered metaphor and world knowledge; LLMs can reproduce surface-level imagery but may lack global thematic coherence—though this is challenging to quantify in single instances.
Cross-model Generalization: Fine-tuned detectors often fail catastrophically when confronted with poems from novel LLMs or generation pipelines, revealing that token-level statistical biases vary considerably between models.
Poetic Homogeneity: Shared vocabulary and imagery across the tradition blur human–AI distinctions, especially under strict metric/formal constraints in classical genres (Li et al., 11 Apr 2026).

In decision-based SPD, prompting LLMs to judge directly performs at or below random chance when confronted with well-controlled adversarial or critique-refined AI poetry (Li et al., 11 Apr 2026).

5. Metrics and Evaluation Paradigms

SPD task evaluation employs the following metrics:

Accuracy: $\mathrm{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
Precision: $\mathrm{Precision} = \frac{TP}{TP + FP}$
Recall (TPR): $\mathrm{Recall} = \frac{TP}{TP + FN}$
F1-score: $F1 = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$
AUC-ROC: Area under Receiver Operating Characteristic curve as threshold varies
Macro-F1: Unweighted mean of F1 for human and AI classes

Granularity significantly affects detection rates. Moving from single-poem detection (SPD) to multi-poem aggregation (MPD-6 or MPD-12) results in Macro-F1 increases by 10–20 points; most achievable performance is reached by MPD-6, with diminishing returns for larger batches (Li et al., 11 Apr 2026).

6. Current Best Practices and Research Directions

Effective SPD strategies combine multiple detector modalities and leverage both supervised and unsupervised cues, as well as multimodal augmentation (Wang et al., 1 Sep 2025, Li et al., 11 Apr 2026, Wang et al., 21 May 2026):

Supervised Classifiers: Fine-tuned RoBERTa-type transformers anchor current state-of-the-art under text-only constraints.
Zero-shot Statistical Methods: Log-Rank, Log-Likelihood provide auxiliary signal, especially when ensemble-averaged.
Multimodal Approaches: Semantic-guided detectors (e.g., IMAGINE based on Gemini) integrate poem-generated images, amplifying detection of emotional and thematic congruity.
Stylometric Extensions: Future detectors may benefit from explicit features such as classical allusions, idiom ratios, rhyme/tonal pattern compliance, and stanzaal coherence metrics.
Contrastive and Adversarial Training: Style-cloned and critique-refined AI outputs should expand training data. Synthetic perturbation and back-translation help decouple superficial fluency from deeper stylistic elements.
Higher-Order Coherence and Latent Representation Learning: Embeddings sensitive to rhetorical unity (SimCSE), line-parallelism, or latent stylistic “voice” may strengthen discriminability.

Recommendations for benchmarking and future work include adversarial re-training, modeling explicit poetic features, ensemble methods that combine zero-shot and neural cues, and evaluation strategies that test for poet- or LLM-homogeneity clustering (style homogeneity tests).

7. Limitations, Open Problems, and Future Prospects

While ensemble and multimodal detectors have advanced SPD for both modern and classical Chinese poetry, several limitations persist:

Statistical stylometry is fundamentally constrained by single-poem brevity.
Deep stylistic imitation by advanced LLMs (e.g., in metaphor, punctuation) can foil both neural and classical detectors.
Domain adaptation across LLMs or genres (essay, news, lyrics) is nontrivial and unproven.
Performance depends on reliable image generation for vision-language fusion; noisy or unaligned images degrade detection.
SPD metrics remain an imperfect proxy for literary creativity or world-knowledge depth; human-level “intuition” in poetic recognition may continue to outpace algorithmic approaches.

Ongoing work targets improved integration of structured poetic feature extraction, self-supervised or contrastive learning for robust style encoding, benchmarking unseen LLMs, and extending to multilingual SPD. Advances in these directions may close the persistent gap observed in style imitation and intrinsic poetic voice detection (Wang et al., 1 Sep 2025, Li et al., 11 Apr 2026, Wang et al., 21 May 2026).