Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MuLER: Detailed and Scalable Reference-based Evaluation (2305.14991v2)

Published 24 May 2023 in cs.CL and cs.AI

Abstract: We propose a novel methodology (namely, MuLER) that transforms any reference-based evaluation metric for text generation, such as machine translation (MT) into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types (e.g., errors in translating names of locations). MuLER thus enables a detailed error analysis which can lead to targeted improvement efforts for specific phenomena. We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability in MT evaluation, and other tasks, such as summarization. Analyzing all submissions to WMT in 2014-2020, we find consistent trends. For example, nouns and verbs are among the most frequent POS tags. However, they are among the hardest to translate. Performance on most POS tags improves with overall system performance, but a few are not thus correlated (their identity changes from language to language). Preliminary experiments with summarization reveal similar trends.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Findings of the 2020 conference on machine translation (wmt20). In WMT.
  2. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics.
  3. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics.
  4. Findings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference on Machine Translation, pages 169–214, Copenhagen, Denmark. Association for Computational Linguistics.
  5. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 131–198, Berlin, Germany. Association for Computational Linguistics.
  6. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46, Lisbon, Portugal. Association for Computational Linguistics.
  7. Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 272–303, Belgium, Brussels. Association for Computational Linguistics.
  8. Concreteness ratings for 40 thousand generally known english word lemmas. Behavior Research Methods, 46:904–911.
  9. Leshem Choshen and Omri Abend. 2018a. Automatic metric validation for grammatical error correction. In ACL.
  10. Leshem Choshen and Omri Abend. 2018b. Reference-less measure of faithfulness for grammatical error correction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 124–129, New Orleans, Louisiana. Association for Computational Linguistics.
  11. Denis Emelin and Rico Sennrich. 2021. Wino-X: Multilingual Winograd schemas for commonsense reasoning and coreference resolution. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8517–8532, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  12. Go figure! a meta evaluation of factuality in summarization. arXiv preprint arXiv:2010.12834.
  13. The gem benchmark: Natural language generation, its evaluation and metrics. arXiv preprint arXiv:2102.01672.
  14. Geval: Tool for debugging nlp datasets and models. In ACL 2019.
  15. 𝐐2superscript𝐐2\mathbf{Q}^{2}bold_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. arXiv preprint arXiv:2104.08202.
  16. To which world regions does the valence–dominance model of social perception apply? Nature human behaviour, 5(1):159–169.
  17. Christopher S. G. Khoo and Sathik Basha Johnkhan. 2018. Lexicon-based sentiment analysis: Comparative evaluation of six sentiment lexicons. Journal of Information Science, 44:491 – 511.
  18. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  19. Grasp: A library for extracting and exploring human-interpretable textual patterns. arXiv preprint arXiv:2104.03958.
  20. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  21. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  22. ExplainaBoard: An explainable leaderboard for NLP. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 280–289, Online. Association for Computational Linguistics.
  23. Fine-grained evaluation of German-English machine translation based on a test suite. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 578–587, Belgium, Brussels. Association for Computational Linguistics.
  24. Saif Mohammad. 2018. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 174–184, Melbourne, Australia. Association for Computational Linguistics.
  25. Abstractive text summarization using sequence-to-sequence rnns and beyond. In CoNLL.
  26. The measurement of meaning. 47. University of Illinois press.
  27. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, Online. Association for Computational Linguistics.
  28. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  29. On neurons invariant to sentence structural changes in neural machine translation. ArXiv, abs/2110.03067.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67.
  31. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
  32. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana. Association for Computational Linguistics.
  33. James A Russell. 1980. A circumplex model of affect. Journal of personality and social psychology, 39(6):1161.
  34. James A Russell. 2003. Core affect and the psychological construction of emotion. Psychological review, 110(1):145.
  35. Sam Shleifer and Alexander M Rush. 2020. Pre-trained summarization distillation. arXiv preprint arXiv:2010.13002.
  36. Mediators in determining what processing BERT performs first. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 86–93, Online. Association for Computational Linguistics.
  37. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.
  38. Ravikiran Vadlapudi and Rahul Katragadda. 2010. On automated evaluation of readability of summaries: Capturing grammaticality, focus, structure and coherence. In Proceedings of the NAACL HLT 2010 Student Research Workshop, pages 7–12, Los Angeles, CA. Association for Computational Linguistics.
  39. Analyzing the source and target contributions to predictions in neural machine translation. arXiv preprint arXiv:2010.10907.
  40. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
  41. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
  42. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.

Summary

We haven't generated a summary for this paper yet.