Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation (2009.07602v1)

Published 16 Sep 2020 in cs.CL

Abstract: Despite the success of existing referenced metrics (e.g., BLEU and MoverScore), they correlate poorly with human judgments for open-ended text generation including story or dialog generation because of the notorious one-to-many issue: there are many plausible outputs for the same input, which may differ substantially in literal or semantics from the limited number of given references. To alleviate this issue, we propose UNION, a learnable unreferenced metric for evaluating open-ended story generation, which measures the quality of a generated story without any reference. Built on top of BERT, UNION is trained to distinguish human-written stories from negative samples and recover the perturbation in negative stories. We propose an approach of constructing negative samples by mimicking the errors commonly observed in existing NLG models, including repeated plots, conflicting logic, and long-range incoherence. Experiments on two story datasets demonstrate that UNION is a reliable measure for evaluating the quality of generated stories, which correlates better with human judgments and is more generalizable than existing state-of-the-art metrics.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Jian Guan (65 papers)
  2. Minlie Huang (226 papers)
Citations (64)

Summary

  • The paper introduces UNION, a novel unreferenced metric that uses BERT and negative sampling to evaluate open-ended story generation without relying on reference texts, making it suitable for the one-to-many nature of story outputs.
  • A key contribution involves constructing negative samples by replicating common errors in generated stories (repetition, incoherence, conflicts) to train UNION to distinguish between human-quality and flawed outputs.
  • Extensive experiments demonstrate that UNION significantly outperforms traditional referenced metrics like BLEU and MoverScore in correlating with human judgments across diverse datasets and quality levels, providing a more reliable automatic evaluation method.

Evaluation of Open-ended Story Generation with Union Metric

The paper introduces "Union," an unreferenced metric specifically designed to evaluate the quality of open-ended story generation. The traditional referenced metrics, such as BLEU and MoverScore, often fail to correlate effectively with human judgments in scenarios involving open-ended text generation due to the inherent one-to-many nature of plausible outputs from a single input. This paper addresses this issue by proposing Union, which evaluates story generation quality without requiring references.

Key Contributions

  1. Learnable Unreferenced Metric: Union leverages BERT to distinguish human-written stories from constructed negative samples. The process is devoid of dependence on outputs from neural language generation (NLG) models, enhancing its generalizability. Union also undertakes an auxiliary reconstruction task, further refining its evaluation capability.
  2. Negative Sample Construction: The authors detailed the predominant errors observed in existing NLG models - repetitive content, incoherent plots, conflicting logic, and chaotic scenes. They replicate these issues using negative sampling methods such as repetition, substitution, reordering, and negation alteration to auto-generate negative samples from human-written stories.
  3. Empirical Validation: Extensive experimentation on two distinct datasets (ROCStories and WritingPrompts) demonstrates Union’s superior correlation with human judgment compared to other metrics. The additional experiment results highlight Union's robustness across diverse dataset drifts and quality variations.

Evaluation and Results

Union's effectiveness is substantiated through its high correlation with human judgments across different experiments. Pearson's, Spearman's, and Kendall’s correlation coefficients reveal Union's capacity to maintain reliability and robustness compared to existing metrics. Union demonstrated its capability to handle dataset drift, maintaining evaluative integrity across datasets with distinct characteristics in terms of length and topic. Additionally, Union showcases its competence in handling quality drift, lending itself to reliable evaluations over samples with variable quality levels.

Practical and Theoretical Implications

Union’s ability to accurately evaluate story quality without references presents a significant advancement in story generation tasks, offering a more reliable automatic evaluation method that aligns closer with human perceptions. The metric’s generalizable nature is critical for practical applications involving diverse datasets or models where train-test distributions are notably variant.

Future Directions

The Union framework sets a precedent for creating robust unreferenced metrics extending beyond story generation into dialog systems and broader NLG tasks. Potential improvements could focus on refining negative sample construction techniques and incorporating external knowledge bases to discern semantic coherence and logical consistency with greater precision.

In summary, Union marks a substantial development for evaluating open-ended story generation, directly confronting the limitations of referenced metrics and presenting a viable path forward for high-quality automatic assessments.

Github Logo Streamline Icon: https://streamlinehq.com