- The paper introduces a comprehensive framework, POEMetric, which rigorously measures poetic dimensions including form, creativity, and emotional resonance.
- It employs both rule-based classifiers and expert/LLM evaluators to assess 10 dimensions, revealing LLM strengths in stylistic form but deficiencies in creative expression.
- Results indicate LLMs match human accuracy in structured elements but fall short in expressive depth, highlighting the need for improved machine poetic artistry.
POEMetric: A Comprehensive Framework for Poetry Evaluation in the Era of LLMs
Introduction
"POEMetric: The Last Stanza of Humanity" (2604.03695) proposes a rigorous and multifaceted evaluation framework for the assessment of poetry generated by LLMs and compares their performance to human poets. Prior research in automated poetry generation has largely centered on formal properties, such as rhyme and meter, as well as general fluency and coherence. However, these dimensions are insufficient for capturing the advanced creative abilities that define high-quality poetry, namely creativity, idiosyncrasy, emotional resonance, and the use of complex literary techniques. This study systematically addresses this gap by introducing POEMetric, conducting large-scale comparative studies across 203 human poems and 6,090 LLM-generated poems from 30 diverse models, and employing both rule-based and LLM/human-expert evaluators.
Framework and Methodology
POEMetric formalizes poetry evaluation into ten dimensions grouped into three overarching categories:
- Basic instruction-following abilities: form (meter, rhyme) accuracy and theme alignment.
- Advanced creative abilities: creativity, lexical diversity, idiosyncrasy, emotional resonance, use of imagery, and literary devices.
- General appraisal: overall poem quality and authorship estimation.
The framework leverages a curated fixed-form human poem dataset spanning seven canonical genres, annotated with metadata for meter, rhyme, theme, and imagery. Every LLM receives prompts matching human poem forms and themes, ensuring controlled comparisons. For form and thematic dimensions, the authors implement a rule-based classifier using a hybrid of syntactic and phonetic analysis. For more nuanced qualities, both an LLM-as-a-judge (Gemini-2.5-Pro, selected based on inter-rater reliability with humans) and a panel of human poetry experts are used, each employing a detailed Likert-scale survey reflecting the POEMetric metrics.
Results
All top-tier LLMs, especially Gemini-2.5-Pro and DeepSeek-R1, demonstrate proficiency in form (Gemini-2.5-Pro scores 4.26/5) and theme (4.99/5) conformity. Rule-based evaluation confirms high accuracy in following prescribed rhyme and meter, with leading open and closed-source models attaining parity with or exceeding humans in these constrained aspects. This result reflects the efficacy of current LLMs in syntax-, meter-, and rule-driven generation tasks.
Advanced Creative Abilities
A systematic deficit persists for LLMs across all qualitative advanced creative metrics:
- Creativity: Human poems achieve a mean score of 4.02, best LLMs (DeepSeek-R1, Gemini-2.5-Pro, Claude-3.7-Sonnet) are consistently below 3.5.
- Idiosyncrasy: Humans (3.95) surpass LLMs by a margin exceeding 1 point, revealing a generalized inability for LLMs to manifest unique authorial signatures.
- Emotional Resonance: Human poetry scores 4.06, with LLMs trailing significantly, and most failing to evoke genuine affect.
- Imagery/Literary Devices: Humans are rated markedly higher for employing vivid, original imagery (4.49) and sophisticated techniques (4.67), despite LLMsโ mechanical insertion of literary forms.
Interestingly, only in lexical diversity (as measured by MATTR) do LLMs approach or marginally surpass humans, suggesting a tendency to overuse rare or varied words without deep semantic integration.
Overall Appraisal and Authorship Attribution
In terms of overall poem quality, human poets lead (4.22 vs. top LLMโs 3.20). Despite anonymization, both LLM and human judges can distinguish human from LLM compositions above chance. Gemini-2.5-Pro identifies 39.4% of human poems by style or direct recall, while human judges display even greater discriminative caution in favoring human authorship labels. Notably, LLM-generated poems tend toward detectable patterns of repetition and lack of authorial individuality, as shown by higher n-gram redundancy and lowered idiosyncrasy.
Evaluation Validity
The effectiveness and agreement of POEMetric are supported by strong inter-rater reliability statistics (Observed Proportion Agreement 0.662 for Gemini-2.5-Pro vs. human experts) and statistically significant correlations (Quadratic Weighted Kappa 0.361; Spearman ฯ=0.378). This validates the use of LLM-as-a-judge in large-scale automated evaluationโwith the caveat that nuanced discrimination benefits from human oversight.
Implications and Future Directions
This study demonstrates that, while the current generation of LLMs can reliably produce poetry with high formal accuracy and thematic fidelity, they continue to underperform on the dimensions that fundamentally characterize poetic artistry: creativity, personal style, affective depth, and literary sophistication. The result exposes the current limitations of autoregressive or RL-finetuned transformer architectures in tasks requiring genuine originality or experiential grounding.
The proposed framework, and its accompanying annotated dataset and codebase, gives the field a robust testbed for benchmarking new models and approachesโwhether through architectural innovation, retrieval-augmented generation, few-shot priming with idiosyncratic author data, or explicit optimization for creativity and emotional richness. Additionally, POEMetric enables future extension to free verse, cross-linguistic/cultural domains, and non-textual poetic modalities.
On the evaluation side, the success of Gemini-2.5-Pro as a digital judge for creative domains highlights (and stresses) the growing viability of LLMs as evaluation instruments for generative tasks but also underscores the necessity of maintaining rigorous human expert involvement for the most qualitative, open-ended facets.
Conclusion
"POEMetric: The Last Stanza of Humanity" (2604.03695) sets a new benchmark for comprehensive, multidimensional evaluation of machine-generated poetry. The study conclusively demonstrates the persistent creative gap between human and machine-authored verse, delineates the formal and creative capabilities of leading LLMs, and provides a validated protocol for future research on artistic text generation and evaluation. As LLMs continue to evolve, frameworks like POEMetric will be instrumental in tracking progress toward authentic machine creativity and in developing methods capable of producing not only formal but also deeply idiosyncratic and affectively resonant poetic works.