Do Massively Pretrained Language Models Make Better Storytellers? (1909.10705v1)

Published 24 Sep 2019 in cs.CL, cs.AI, and cs.LG

Abstract: Large neural LLMs trained on massive amounts of text have emerged as a formidable strategy for Natural Language Understanding tasks. However, the strength of these models as Natural Language Generators is less clear. Though anecdotal evidence suggests that these models generate better quality text, there has been no detailed study characterizing their generation abilities. In this work, we compare the performance of an extensively pretrained model, OpenAI GPT2-117 (Radford et al., 2019), to a state-of-the-art neural story generation model (Fan et al., 2018). By evaluating the generated text across a wide variety of automatic metrics, we characterize the ways in which pretrained models do, and do not, make better storytellers. We find that although GPT2-117 conditions more strongly on context, is more sensitive to ordering of events, and uses more unusual words, it is just as likely to produce repetitive and under-diverse text when using likelihood-maximizing decoding algorithms.

Citations (161)

View on Semantic Scholar

Summary

The paper compares the storytelling abilities of the large GPT2-117 model to a specialized model using automatic metrics like coherence, diversity, and context relevance.
Despite its general architecture, the pretrained GPT2-117 shows stronger context conditioning and uses more rare words than the task-specific Fusion Model.
Findings reveal that decoding algorithms significantly impact text diversity and repetition in both models, emphasizing their critical role alongside pretraining for quality generation.

Overview of Massively Pretrained LLMs as Storytellers

The paper "Do Massively Pretrained LLMs Make Better Storytellers?" examines the storytelling capabilities of large-scale pretrained LLMs, specifically focusing on the OpenAI GPT2-117 model, and compares it to a specialized neural story generation model known as the Fusion Model. The investigation centers on the effects of extensive pretraining on open-ended natural language generation tasks, such as story generation, and evaluates various metrics related to storytelling competence.

Evaluation of Storytelling Competence

The paper conducts a comparative analysis using several automatic metrics to assess story generation capabilities. These metrics include text coherence, story-prompt relatedness, lexical and syntactic diversity, repetition, rare word usage, named entity usage, and overall text surprise. Each metric is intended to provide insight into the competence of the LLMs in generating text that is contextually relevant, coherent, diverse, and stylistically appropriate.

Key Findings

Contextual Conditioning: GPT2-117 demonstrates a significantly stronger ability to condition on context compared to the Fusion Model, showing enhanced story-prompt relevance. This is surprising given that the Fusion Model employs a task-specific architecture designed to enhance prompt-story coherence.
Coherence and Ordering: Both models are proficient in detecting reordered story events, but GPT2-117 exhibits greater sensitivity to event sequencing.
Diversity and Repetition: When employing likelihood-maximizing decoding algorithms, such as top- $k$ sampling with low $k$ , both models produce repetitive and under-diverse text that improves as $k$ increases towards the vocabulary size.
Rare Word Usage: GPT2-117 generally uses more rare words than the Fusion Model, likely due to its pretraining on a diverse WebText corpus and its use of byte-pair encoding.
Syntactic Complexity: Generation at low $k$ values produces text with reduced syntactic sophistication, which normalizes at higher $k$ values.
Element of Surprise: High $k$ values increase text novelty and complexity while maintaining lower overall probabilistic confidence, indicating that the models can produce surprising text that deviates from average expectations.

Implications and Future Directions

The findings suggest that while large-scale pretraining enhances certain aspects of storytelling, like context sensitivity and word rareness, it does not inherently resolve issues like repetition and diversity, which are primarily affected by the choice of decoding algorithm. These observations underline the significance of decoding strategies in text generation tasks and highlight the need for novel methodologies that impart logic, coherence, and commonsense reasoning to AI-generated narratives.

Future research should focus on developing mechanisms to improve coherence, incorporate world knowledge, and enhance reasoning abilities in LLMs. Such advancements could have practical implications for automated storytelling, dialogue systems, and other applications within natural language generation.

Additionally, the paper illustrates the limitations of existing text evaluation methodologies, pointing to the necessity for reliable metrics to objectively measure text coherence and quality. The authors release their evaluation code to facilitate further research in this area, aiming to encourage exploration into more effective pretraining, model architectures, and decoding algorithms.

Overall, the paper provides valuable insights into the storytelling potential of pretrained LLMs and informs future research directions in natural language processing and generation.