Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets (2406.18906v3)

Published 27 Jun 2024 in cs.CL

Abstract: LLMs can now generate and recognize poetry. But what do LLMs really know about poetry? We develop a task to evaluate how well LLMs recognize one aspect of English-language poetry--poetic form--which captures many different poetic features, including rhyme scheme, meter, and word or line repetition. By using a benchmark dataset of over 4.1k human expert-annotated poems, we show that state-of-the-art LLMs can successfully identify both common and uncommon fixed poetic forms--such as sonnets, sestinas, and pantoums--with surprisingly high accuracy. However, performance varies significantly by poetic form; the models struggle to identify unfixed poetic forms, especially those based on topic or visual features. We additionally measure how many poems from our benchmark dataset are present in popular pretraining datasets or memorized by GPT-4, finding that pretraining presence and memorization may improve performance on this task, but results are inconclusive. We release a benchmark evaluation dataset with 1.4k public domain poems and form annotations, results of memorization experiments and data audits, and code.

PDF HTML Abstract

Evaluation of Poetic Forms by LLMs

The paper "Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets" by Melanie Walsh, Anna Preus, and Maria Antoniak provides a rigorous examination of how well contemporary LLMs can identify various poetic forms. The authors introduce a benchmark designed to evaluate LLMs' capabilities in recognizing more than 20 fixed and unfixed poetic forms in the English language and analyze the implications for NLP, digital humanities, and cultural heritage.

Summary of the Study

The paper focuses on assessing LLMs' abilities to categorize poems by form, a task requiring understanding complex features such as rhyme schemes, meter, and repetition. The poetic forms considered include common ones like sonnets and haikus, as well as more intricate forms like sestinas and pantoums.

Methodology

The researchers used a diverse set of poems sourced from reputable institutions such as the Poetry Foundation and the Academy of American Poets. Additionally, they manually digitized a selection of poetry books. The resulting dataset comprises over 4,197 poems tagged by human experts.

They evaluated multiple LLMs, including GPT-3.5 Turbo, GPT-4, GPT-4o, Claude 3 Sonnet, Llama3, and Mixtral 8x22B, using different zero-shot prompt types (e.g., poem text only, title and author only, first line only). The models' performance was measured against human expert annotations.

Findings

The results indicate that the LLMs generally perform well on common fixed forms like sonnets and haikus, achieving high F1 scores (near or above 0.9 for most models) when provided with the text of the poem. However, performance declines on more complex or less common forms like sestinas and pantoums.

Fixed Forms: GPT-4 and GPT-4o showed particular strength in detecting complex forms with repetition precision, such as sestinas (F1=0.87; 0.73) and pantoums (F1=0.81; 0.82).
Unfixed Forms: The models struggled with forms based on topic (e.g., elegies, ars poetica) and visual features (e.g., concrete poetry, prose poems).

Analysis of Pretraining Data

The authors also investigated the presence of memorized poems in popular pretraining datasets (e.g., Dolma). They found substantial memorization of poetry content in GPT-4's output, revealing the potential biases introduced by pretraining data. In particular, Common Crawl and C4 datasets contained significant percentages of the used poems, which poses challenges for creating unbiased benchmarks.

Implications and Future Directions

NLP and Model Evaluation

This paper underscores the need for nuanced benchmarks that account for the complexities of creative genres like poetry. The observed performance differences between poetic forms highlight the varying capabilities of modern LLMs and their reliance on the structure and frequency of training data.

Digital Humanities and Cultural Analytics

For digital humanities scholars, the research shows the potential and current limitations of using LLMs for literary analysis. Automated form detection could notably enhance the discoverability of poetic texts in digital archives, aiding research and education.

Cultural Heritage Collections

For libraries and cultural institutions, these findings suggest that integrating LLM-based tools could facilitate the cataloging of large poetry collections. However, careful attention to the limitations and biases of such models is crucial.

Conclusion

The paper offers a thorough examination of how well LLMs understand and categorize English poetic forms. While the results are promising, especially for commonly studied forms, they also highlight significant gaps in the models' capabilities for less frequent and more complex forms. Future research should explore multi-label classifications and include a broader range of poetic traditions and languages to build more comprehensive evaluation frameworks. This research bridges the fields of NLP and digital humanities, opening avenues for enhanced literary analysis and text categorization with advanced AI tools.

In addition to their technical contributions, the authors call for more interdisciplinary collaboration between computer scientists and literary scholars to develop nuanced evaluation tools that respect the diversity and complexity inherent in poetic forms.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Melanie Walsh (6 papers)
Anna Preus (2 papers)
Maria Antoniak (20 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/mellymeldubs/status/1806818229660319974

https://twitter.com/fly51fly/status/1809713465805926656

https://twitter.com/gm8xx8/status/1806524862334177534

https://twitter.com/knishimae0531/status/1809750928049991876

https://twitter.com/arxivsanitybot/status/1807236439899320344