Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluation of Text Generation: A Survey (2006.14799v2)

Published 26 Jun 2020 in cs.CL and cs.LG
Evaluation of Text Generation: A Survey

Abstract: The paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years. We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics. For each category, we discuss the progress that has been made and the challenges still being faced, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models. We then present two examples for task-specific NLG evaluations for automatic text summarization and long text generation, and conclude the paper by proposing future research directions.

Overview of Evaluation Metrics for Natural Language Generation

The paper "Evaluation of Text Generation: A Survey" by Celikyilmaz, Clark, and Gao presents a comprehensive survey on the methods for evaluating natural language generation (NLG) systems. The paper categorizes evaluation methods into three key areas: human-centric evaluation metrics, automatic metrics not requiring training, and machine-learned metrics. The paper also focuses on two specific NLG tasks—automatic text summarization and long text generation—using these categorizations as frameworks. With these insights, the paper paves the way for future developments in the efficient evaluation of NLG models and proposes new directions in NLG evaluation.

Human-Centric Evaluation Metrics

Human evaluation remains the gold standard for assessing NLG systems due to its nuanced understanding of natural language. However, the paper acknowledges the practical difficulties in implementing comprehensive human evaluations, mainly due to cost, time, variability in human judgment, and scalability issues. Human-centric evaluations are further divided into intrinsic evaluations, which focus on measuring the quality and relevance of the output text to human judgment, and extrinsic evaluations, which involve the usefulness of generated text in downstream applications. This segmentation highlights the complexity and fine-grained nature of human evaluations.

Untrained Automatic Metrics

Automatic metrics have been widely employed for faster and more cost-effective evaluation of NLG systems. The paper categorizes these metrics into n-gram overlap metrics, distance-based metrics, content overlap metrics, and grammatical feature-based metrics. Each of these subcategories tackles distinct aspects of text generation. Despite being limited by their reliance on word or string-level matches—often overlooking semantic equivalence—they provide efficient proxies for formal evaluations. The paper emphasizes the continued relevance of metrics like BLEU and ROUGE, while also encouraging the development of newer metrics that better correlate with human judgments.

Machine-Learned Metrics

The evaluation of NLG systems increasingly leverages machine-learned models, particularly to assess outputs where multiple valid solutions exist. These methods, often using embeddings from deep learning models like BERT, aim to simulate human judges' evaluations. The authors propose machine-learned metrics as not only more scalable solutions but also more robust correlatives to human evaluation, especially useful in open-ended and creative text generation tasks. Nonetheless, the paper warns of potential pitfalls, such as overfitting to specific dataset biases.

Task-Specific Evaluation: Summarization and Long-Text Generation

The authors delve into task-specific evaluations through the lens of automatic text summarization and long text generation. In summarization, intrinsic and extrinsic methods are utilized to evaluate content fidelity, linguistic quality, and usefulness in downstream tasks. Recognizing the task’s complexity, the survey discusses varied dimensions such as coherence, factual consistency, and informativeness. For long text generation, attention is drawn to challenges such as maintaining discourse coherence and lexical cohesion. These tasks face inherent difficulties with current metrics, frequently necessitating bespoke methods and metrics to provide valid and reliable evaluation.

Implications and Future Directions

The paper presents foundational insights for improving NLG evaluation by aligning them with human expectations and judgments. Understanding these metrics' limitations can guide future development toward more nuanced and comprehensive evaluation frameworks. The authors encourage further research into areas like factual consistency, bias mitigation, and ethics in NLG evaluation to address evolving challenges in the field. As NLG technologies continue to advance, and their applications proliferate, the development of robust, reliable, and scalable evaluation methods becomes increasingly essential. Overall, the survey emphasizes the dynamic nature of NLG evaluation and its critical role in driving progress in natural language generation research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Asli Celikyilmaz (80 papers)
  2. Elizabeth Clark (16 papers)
  3. Jianfeng Gao (344 papers)
Citations (350)