- The paper introduces UNION, a novel unreferenced metric that uses BERT and negative sampling to evaluate open-ended story generation without relying on reference texts, making it suitable for the one-to-many nature of story outputs.
- A key contribution involves constructing negative samples by replicating common errors in generated stories (repetition, incoherence, conflicts) to train UNION to distinguish between human-quality and flawed outputs.
- Extensive experiments demonstrate that UNION significantly outperforms traditional referenced metrics like BLEU and MoverScore in correlating with human judgments across diverse datasets and quality levels, providing a more reliable automatic evaluation method.
Evaluation of Open-ended Story Generation with Union Metric
The paper introduces "Union," an unreferenced metric specifically designed to evaluate the quality of open-ended story generation. The traditional referenced metrics, such as BLEU and MoverScore, often fail to correlate effectively with human judgments in scenarios involving open-ended text generation due to the inherent one-to-many nature of plausible outputs from a single input. This paper addresses this issue by proposing Union, which evaluates story generation quality without requiring references.
Key Contributions
- Learnable Unreferenced Metric: Union leverages BERT to distinguish human-written stories from constructed negative samples. The process is devoid of dependence on outputs from neural language generation (NLG) models, enhancing its generalizability. Union also undertakes an auxiliary reconstruction task, further refining its evaluation capability.
- Negative Sample Construction: The authors detailed the predominant errors observed in existing NLG models - repetitive content, incoherent plots, conflicting logic, and chaotic scenes. They replicate these issues using negative sampling methods such as repetition, substitution, reordering, and negation alteration to auto-generate negative samples from human-written stories.
- Empirical Validation: Extensive experimentation on two distinct datasets (ROCStories and WritingPrompts) demonstrates Union’s superior correlation with human judgment compared to other metrics. The additional experiment results highlight Union's robustness across diverse dataset drifts and quality variations.
Evaluation and Results
Union's effectiveness is substantiated through its high correlation with human judgments across different experiments. Pearson's, Spearman's, and Kendall’s correlation coefficients reveal Union's capacity to maintain reliability and robustness compared to existing metrics. Union demonstrated its capability to handle dataset drift, maintaining evaluative integrity across datasets with distinct characteristics in terms of length and topic. Additionally, Union showcases its competence in handling quality drift, lending itself to reliable evaluations over samples with variable quality levels.
Practical and Theoretical Implications
Union’s ability to accurately evaluate story quality without references presents a significant advancement in story generation tasks, offering a more reliable automatic evaluation method that aligns closer with human perceptions. The metric’s generalizable nature is critical for practical applications involving diverse datasets or models where train-test distributions are notably variant.
Future Directions
The Union framework sets a precedent for creating robust unreferenced metrics extending beyond story generation into dialog systems and broader NLG tasks. Potential improvements could focus on refining negative sample construction techniques and incorporating external knowledge bases to discern semantic coherence and logical consistency with greater precision.
In summary, Union marks a substantial development for evaluating open-ended story generation, directly confronting the limitations of referenced metrics and presenting a viable path forward for high-quality automatic assessments.