Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis (2212.10179v1)

Published 20 Dec 2022 in cs.CL

Abstract: The state-of-the-art LLM-based automatic metrics, e.g. BARTScore, benefiting from large-scale contextualized pre-training, have been successfully used in a wide range of natural language generation (NLG) tasks, including machine translation, text summarization, and data-to-text. Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors (e.g. imperfections in fluency) can produce high-quality human judgments. This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis. To this end, we augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors. Experimental results show that BARTScore++ can consistently improve the performance of vanilla BARTScore and outperform existing top-scoring metrics in 20 out of 25 test settings. We hope our technique can also be extended to other pre-trained model-based metrics. We will release our code and scripts to facilitate the community.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Qingyu Lu (6 papers)
  2. Liang Ding (159 papers)
  3. Liping Xie (4 papers)
  4. Kanjian Zhang (8 papers)
  5. Derek F. Wong (69 papers)
  6. Dacheng Tao (829 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.