Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 221 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

A Critical Look at Meta-evaluating Summarisation Evaluation Metrics (2409.19507v1)

Published 29 Sep 2024 in cs.CL

Abstract: Effective summarisation evaluation metrics enable researchers and practitioners to compare different summarisation systems efficiently. Estimating the effectiveness of an automatic evaluation metric, termed meta-evaluation, is a critically important research question. In this position paper, we review recent meta-evaluation practices for summarisation evaluation metrics and find that (1) evaluation metrics are primarily meta-evaluated on datasets consisting of examples from news summarisation datasets, and (2) there has been a noticeable shift in research focus towards evaluating the faithfulness of generated summaries. We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics and analyze the generalization ability of existing evaluation metrics. In addition, we call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal and the role of summarisation in the workflow.

Summary

The paper critically evaluates various summarisation metrics by comparing automatic measures with human judgments to identify performance discrepancies.
It highlights the gap between system-level and summary-level evaluations, stressing the need for standardized annotation practices and diverse benchmarks.
The study advocates for incorporating domain-specific quality dimensions and user-centric evaluations to enhance faithfulness and practical applicability in NLP.

A Critical Look at Meta-evaluating Summarisation Evaluation Metrics

Introduction

The evaluation of summarisation metrics is pivotal in the development and assessment of NLP systems. This paper explores the practices of meta-evaluation for summarisation metrics, identifies prevailing trends, and advocates for more diverse benchmarks and methodologies. Notably, the paper emphasizes the increasing focus on faithfulness in summary generation and the necessity for benchmarks that accurately reflect the domain-specific needs and user interactions with summarised content.

Summarisation Evaluation Metrics

Summarisation metrics primarily fall into categories such as summary-only, similarity-based, entailment-based, QA-based, learning-based, and LLM-based metrics. Each category utilizes different aspects of the input data (e.g., source text, generated summary) and employs varied methodologies to assess the quality of summarisation. These metrics are crucial for evaluating summarisation systems for clarity, coherence, and factual consistency.

Meta-evaluation of Automatic Metrics

Meta-evaluation seeks to assess the reliability and validity of summarisation metrics by comparing them against human judgments. Typically, this involves calculating the correlation between the evaluation results of automatic metrics and human judgements across a set of generated summaries. The findings highlight the discrepancy between system-level and summary-level evaluations, underscoring the complexity of accurately reflecting human judgements in automatic evaluations.

Figure 1: The distribution of consistency scores, measured using WeCheck, between source text and reference summary from different datasets.

Choosing Data to Annotate

Current benchmarks frequently rely on datasets from news and dialogue summarisation, potentially limiting the generalization of evaluation metrics to other domains. This raises concerns about the adaptability and domain-specific effectiveness of summarisation metrics. The paper calls for benchmarks that encompass a wider array of domains and summarisation constraints to enhance the robustness and applicability of these metrics.

Figure 2: Evaluation results using ROUGE-L, where higher scores indicate more overlap between generated and reference summaries.

Defining Quality Dimensions

While recent efforts predominantly focus on intrinsic evaluation metrics like coherence and faithfulness, extrinsic evaluations considering user-centric quality dimensions remain underexplored. Such an approach acknowledges the variability in user preferences and application scenarios, which is essential for developing user-aligned summarisation systems.

Collecting Human Judgements

The efficacy of summarisation metrics significantly hinges on the quality of human annotations used for meta-evaluation. The paper highlights the need for improved annotation guidelines and involvement of end-users in the annotation process, which could better reflect real-world applications and expectations. It also suggests standardizing practices to ensure consistency and reliability across evaluations.

Comparing Automatic Metrics Against Human Judgements

Assessing the correlation between automatic metrics and human judgments remains a key component of meta-evaluation. However, the paper argues that high correlation alone may not suffice to certify the effectiveness of a metric. Factors such as the metric's ability to detect errors in similar-quality systems, statistical power, and the impact of reference summaries play crucial roles in comprehensive evaluations.

Figure 3: Evaluation results using WeCheck, where higher scores indicate higher consistency levels between source texts and generated summaries.

Conclusion

The paper advocates for the development of more diverse and standardised benchmarks that cater to varied domains and capture the nuances of user-centric quality dimensions. By refining evaluation methodologies, including data selection, quality dimension definition, and annotation practices, the field can advance towards more reliable and applicable summarisation metrics. This progress is vital for fostering robust systems that align with users' diverse and evolving needs.

In summary, the paper posits that enhancing the generality and domain-specific applicability of summarisation metrics, alongside standardised meta-evaluation practices, will lead to effective and user-aligned summarisation systems in NLP.