Cut the CARP: Fishing for zero-shot story evaluation (2110.03111v3)

Published 6 Oct 2021 in cs.CL

Abstract: Recent advances in large-scale LLMs (Raffel et al., 2019; Brown et al., 2020) have brought significant qualitative and quantitative improvements in machine-driven text generation. Despite this, generation and evaluation of machine-generated narrative text remains a challenging problem. Objective evaluation of computationally-generated stories may be prohibitively expensive, require meticulously annotated datasets, or may not adequately measure the logical coherence of a generated story's narratological structure. Informed by recent advances in contrastive learning (Radford et al., 2021), we present Contrastive Authoring and Reviewing Pairing (CARP): a scalable, efficient method for performing qualitatively superior, zero-shot evaluation of stories. We show a strong correlation between human evaluation of stories and those of CARP. Model outputs more significantly correlate with corresponding human input than those language-model based methods which utilize finetuning or prompt engineering approaches. We also present and analyze the Story-Critique Dataset, a new corpora composed of 1.3 million aligned story-critique pairs derived from over 80,000 stories. We expect this corpus to be of interest to NLP researchers.

Citations (16)

View on Semantic Scholar

Summary

The paper introduces CARP, a zero-shot evaluation mechanism that uses contrastive learning to align narrative and critique representations.
It leverages a 1.3 million story-critique dataset and dual text encoder architecture to outperform autoregressive models with higher human evaluation correlation.
The methodology reduces reliance on fine-tuning while enhancing narrative coherence assessment and computational efficiency in story generation.

Cut the CARP: Fishing for Zero-Shot Story Evaluation

The paper "Cut the CARP: Fishing for zero-shot story evaluation" primarily focuses on addressing the challenges associated with evaluating machine-generated narrative text. With the proliferation of LLMs, the need for effective evaluation metrics and methodologies has intensified. Despite advancements, evaluating the logical coherence and narratological structure of computationally-generated stories remains complex.

CARP, or Contrastive Authoring and Reviewing Pairing, is introduced as an innovative mechanism to perform zero-shot evaluations, leveraging contrastive learning principles akin to methods used in models like CLIP. CARP aligns story representations with critique representations, enhancing the evaluation by learning to match stories with their corresponding critiques. The method circumvents the shortcomings of traditional models that rely heavily on finetuning and prompt engineering, showing a stronger correlation with human evaluations.

At the core of this research is the Story-Critique Dataset, a robust corpus comprising 1.3 million story-critique pairs from over 80,000 stories. This dataset is notably anonymized to prioritize privacy, with specific preprocessing steps like proper noun removal and critique-masking to preserve data integrity. This large-scale and aligned dataset is anticipated to be valuable for ongoing NLP research, offering a substantial foundation for algorithmic improvements in story evaluation.

The paper critiques existing automated story evaluation methods that often fall short due to their dependence on costly human assessments. By using a dual text encoder architecture, CARP can embed and compare story passages with critiques, aiming to improve upon autoregressive baseline models like GPT-J in both accuracy and computational efficiency. CARP's approach to utilizing dual encoders allows for a more sophisticated understanding of intertextual relationships, setting it apart from traditional fine-tuning methodologies—which are noted for their limited efficacy in zero-shot contexts.

The evaluation section highlights the superiority of CARP over autoregressive models, suggesting its proficiency in predicting human evaluative choices on story quality. As indicated by numerical results, CARP models exceed baseline performances, with larger models demonstrating improved correlation with human assessments.

This research presents significant implications for both theoretical and practical applications in NLP. Theoretically, it expands the potential applications of contrastive learning for narrative coherence evaluation. Practically, it suggests future pathways for the integration of automated story evaluation models into broader NLG pipelines, reducing reliance on human review and expediting the content generation process.

Looking forward, the paper hints at further developments in scaling CARP models and expanding datasets to encompass broader logical critique contexts, which could eventually establish new benchmarks in natural language understanding and generation. This indicates a promising trajectory for reducing computational costs while enhancing contextual comprehension by storytelling AI systems.

In conclusion, the introduction of CARP marks a substantial progression in how computational systems can evaluate stories, leveraging linguistic critique data to offer enhanced, scalable, and cost-effective solutions, ultimately refining the narrative generation and assessment processes in AI.

PDF Markdown

Related Papers

YouTube

Show All Videos