Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LUNA: A Framework for Language Understanding and Naturalness Assessment (2401.04522v1)

Published 9 Jan 2024 in cs.CL

Abstract: The evaluation of Natural Language Generation (NLG) models has gained increased attention, urging the development of metrics that evaluate various aspects of generated text. LUNA addresses this challenge by introducing a unified interface for 20 NLG evaluation metrics. These metrics are categorized based on their reference-dependence and the type of text representation they employ, from string-based n-gram overlap to the utilization of static embeddings and pre-trained LLMs. The straightforward design of LUNA allows for easy extension with novel metrics, requiring just a few lines of code. LUNA offers a user-friendly tool for evaluating generated texts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  2. Curious case of language generation evaluation metrics: A cautionary tale. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2322–2328, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  3. Re-evaluating the role of Bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 249–256, Trento, Italy. Association for Computational Linguistics.
  4. Jury: Comprehensive NLP Evaluation toolkit.
  5. Of human criteria and automatic metrics: A benchmark of the evaluation of story generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5794–5836, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  6. Of human criteria and automatic metrics: A benchmark of the evaluation of story generation. In 29th International Conference on Computational Linguistics (COLING 2022).
  7. What are the best systems? new perspectives on nlp benchmarking. Advances in Neural Information Processing Systems, 35:26915–26932.
  8. Automatic text evaluation through the lens of wasserstein barycenters. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10450–10466.
  9. Infolm: A new metric to evaluate summarization & data2text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10554–10562.
  10. Torchmetrics-measuring reproducibility in pytorch. Journal of Open Source Software, 7(70):4101.
  11. Summeval: Re-evaluating summarization evaluation.
  12. Multi-hypothesis machine translation evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1218–1232, Online. Association for Computational Linguistics.
  13. NLG-metricverse: An end-to-end library for evaluating natural language generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3465–3479, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  14. Mingqi Gao and Xiaojun Wan. 2022. Social biases in automatic evaluation metrics for nlg.
  15. SUPERT: Towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1347–1354, Online. Association for Computational Linguistics.
  16. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 708–719.
  17. Michael Hanna and Ondřej Bojar. 2021. A fine-grained analysis of BERTScore. In Proceedings of the Sixth Conference on Machine Translation, pages 507–517, Online. Association for Computational Linguistics.
  18. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  19. TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126, Online. Association for Computational Linguistics.
  20. Jun Ping Ng and Viktoria Abrecht. 2015. Better summarization evaluation with word embeddings for rouge. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1925–1930.
  21. NLG evaluation metrics beyond correlation analysis: An empirical metric preference checklist. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1240–1266, Toronto, Canada. Association for Computational Linguistics.
  22. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark. Association for Computational Linguistics.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  24. Learning to score system summaries for better content selection evaluation. In Proceedings of the Workshop on New Frontiers in Summarization, pages 74–84.
  25. Maja Popović. 2015. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the tenth workshop on statistical machine translation, pages 392–395.
  26. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  27. RoMe: A robust metric for evaluating natural language generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5645–5657, Dublin, Ireland. Association for Computational Linguistics.
  28. Perturbation CheckLists for evaluating NLG evaluation metrics. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7219–7234, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  29. A survey of evaluation metrics used for NLG systems. ACM Computing Surveys (CSUR), 55(2):1–39.
  30. Answers unite! unsupervised metrics for reinforced summarization models. In 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3237–3247. Association for Computational Linguistics.
  31. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. arXiv preprint arXiv:1706.09799.
  32. A pseudo-metric between probability distributions based on depth-trimmed regions. arXiv preprint arXiv:2103.12711.
  33. Fill in the blanc: Human-free quality estimation of document summaries. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 11–20.
  34. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  35. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
  36. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  37. Interpreting BLEU/NIST scores: How much improvement do we need to have a better system? In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
  38. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  39. DiscoScore: Evaluating text generation with BERT and discourse coherence. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3865–3883, Dubrovnik, Croatia. Association for Computational Linguistics.
  40. Deconstructing NLG evaluation: Evaluation practices, assumptions, and their implications. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 314–324, Seattle, United States. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Marat Saidov (2 papers)
  2. Aleksandra Bakalova (3 papers)
  3. Ekaterina Taktasheva (8 papers)
  4. Vladislav Mikhailov (31 papers)
  5. Ekaterina Artemova (53 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets