Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction

Published 9 Jun 2025 in cs.CL | (2506.08172v1)

Abstract: Automated story writing has been a subject of study for over 60 years. LLMs can generate narratively consistent and linguistically coherent short fiction texts. Despite these advancements, rigorous assessment of such outputs for literary merit - especially concerning aesthetic qualities - has received scant attention. In this paper, we address the challenge of evaluating AI-generated microfictions and argue that this task requires consideration of literary criteria across various aspects of the text, such as thematic coherence, textual clarity, interpretive depth, and aesthetic quality. To facilitate this, we present GrAImes: an evaluation protocol grounded in literary theory, specifically drawing from a literary perspective, to offer an objective framework for assessing AI-generated microfiction. Furthermore, we report the results of our validation of the evaluation protocol, as answered by both literature experts and literary enthusiasts. This protocol will serve as a foundation for evaluating automatically generated microfictions and assessing their literary value.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces the GrAImes protocol, a systematic framework for assessing literary quality, coherence, and marketability in Spanish microfiction.
It demonstrates that while ChatGPT-3.5 produces structurally sound texts with strong commercial appeal, it lacks the interpretive depth of human-authored microfiction.
The study emphasizes the need for refining evaluation methods by integrating diverse expert perspectives and quantitative metrics to capture computational creativity.

Evaluation of AI-Generated Spanish Microfiction: Insights from the GrAImes Protocol

The paper "Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction" presents an inquiry into the literary capabilities of AI in generating microfictions, focusing on a systematic evaluation method called GrAImes. The study's objective is to discern whether AI-generated texts can reach the literary depth akin to esteemed human authors such as Jorge Luis Borges, especially in Spanish microfiction.

Overview of AI and Human Microfiction Evaluation

Both human-written and AI-generated microfictions were assessed using the GrAImes protocol, a framework grounded in literary theory. The evaluation considered literary, technical, and commercial aspects via a questionnaire with five open-ended and ten Likert-scale questions. Two groups of evaluators were involved: literary experts and literature enthusiasts. Their responses analyzed the thematic coherence, interpretive depth, narrative plausibility, originality, and potential marketability of the microfictions.

Evaluation Results

The study comprises two separate experiments evaluating human-written and AI-generated texts. The literary experts generally rated human-authored microfictions by well-established authors higher in coherence and creativity. In contrast, AI-generated texts by ChatGPT-3.5 showed stronger commercial potential compared to those created by the Monterroso model.

Noteworthy Results:

Human-written microfictions were often recognized for their thematic complexity and interpretive richness, garnering favorable evaluations primarily influenced by the authors' experience.
AI-generated texts from ChatGPT-3.5, although structurally coherent, faced limitations in original literary quality. However, they were more favored in commercial and reader engagement aspects.
Statistical analyses indicated strong agreement among literary experts in evaluating experienced authors, reflected in higher Cronbach's Alpha reliability scores.

Implications

The paper's findings underscore the differential evaluative criteria applied by literary experts versus literature enthusiasts, the former focusing on artistic depth and technical execution while the latter prioritizes accessibility and enjoyment. The experiments suggest AI can approximate narrative coherence and engage readers, although achieving the interpretive depth characteristic of human literature remains challenging.

The study proposes continued refinement of evaluation frameworks like GrAImes to address ambiguities in literary quality assessment and to integrate more qualitative metrics. Moreover, the existence of interpretative disparities highlights the need for diverse evaluative perspectives to capture the multifaceted nature of literary texts.

Future Directions

The protocol’s applicability across literary genres and languages requires further exploration, including incorporating AI self-assessment mechanisms for a more comprehensive evaluative perspective. The research advances the discourse on computational creativity, inviting future studies to refine methodologies, particularly concerning aesthetic assessments in AI-generated literature.

This paper contributes to computational literature by delineating evaluation criteria bridging literary theory and algorithmic narrative generation, emphasizing that while AI offers innovative narrative possibilities, true literary craft remains a domain requiring nuanced human insight and creativity.

Markdown Report Issue