Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Explainable Evaluation Metrics for Machine Translation (2306.13041v2)

Published 22 Jun 2023 in cs.CL, cs.CY, and cs.LG

Abstract: Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for machine translation (for example, COMET or BERTScore) are based on black-box LLMs. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are more transparent. To foster more widespread acceptance of novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties as well as key goals of explainable machine translation metrics and provide a comprehensive synthesis of recent techniques, relating them to our established goals and properties. In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4. Finally, we contribute a vision of next-generation approaches, including natural language explanations. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent machine translation systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (159)
  1. Adadi, Amina and Mohammed Berrada. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (xai). IEEE Access, 6:52138–52160.
  2. SemEval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 252–263, Association for Computational Linguistics, Denver, Colorado.
  3. SemEval-2016 task 2: Interpretable semantic textual similarity. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 512–524, Association for Computational Linguistics, San Diego, California.
  4. Robust MT evaluation with sentence-level multilingual augmentation. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 469–478, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  5. ACES: Translation accuracy challenge sets for evaluating machine translation metrics. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 479–513, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  6. One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques. arXiv preprint arXiv:1909.03012.
  7. Avramidis, Eleftherios and Vivien Macketanz. 2022. Linguistically motivated evaluation of machine translation metrics based on a challenge set. In Proceedings of the Seventh Conference on Machine Translation. Conference on Machine Translation (WMT-2022), December 7-8, Abu Dhabi, United Arab Emirates, Association for Computational Linguistics.
  8. Mismatching-aware unsupervised translation quality estimation for low-resource languages.
  9. Banerjee, Satanjeev and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Association for Computational Linguistics, Ann Arbor, Michigan.
  10. Towards fine-grained information: Identifying the type and location of translation errors.
  11. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82–115.
  12. Belouadi, Jonas and Steffen Eger. 2023. UScore: An effective approach to fully unsupervised evaluation metrics for machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 358–374, Association for Computational Linguistics, Dubrovnik, Croatia.
  13. Language models can explain neurons in language models. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html. (Date accessed: 14.05.2023).
  14. Biran, Or and Courtenay V. Cotton. 2017. Explanation and justification in machine learning : A survey. In IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI).
  15. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Association for Computational Linguistics, Online.
  16. Benchmarking and survey of explanation methods for black box models. Data Mining and Knowledge Discovery.
  17. Machine learning interpretability: A survey on methods and metrics. Electronics, 8(8).
  18. Evaluation of text generation: A survey.
  19. Explaining neural network predictions on sentence pairs via learning word-group masks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3917–3930, Association for Computational Linguistics, Online.
  20. Generating hierarchical explanations on text classification via feature interaction detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5578–5593, Association for Computational Linguistics, Online.
  21. Exploring robustness of machine translation metrics: A study of twenty-two automatic metrics in the WMT22 metric task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 530–540, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  22. Chen, Yanran and Steffen Eger. 2023. Menli: Robust evaluation metrics from natural language inference. TACL.
  23. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study.
  24. Chiang, Cheng-Han and Hung yi Lee. 2023. Can large language models be an alternative to human evaluations?
  25. A survey of the state of explainable AI for natural language processing. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 447–459, Association for Computational Linguistics, Suzhou, China.
  26. Semi-automated data labeling. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track, volume 133 of Proceedings of Machine Learning Research, pages 156–169, PMLR.
  27. Doshi-Velez, Finale and Been Kim. 2017. Towards a rigorous science of interpretable machine learning.
  28. Explainable artificial intelligence: A survey. In 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pages 0210–0215.
  29. Explaining errors in machine translation with absolute gradient ensembles. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 238–249, Association for Computational Linguistics, Punta Cana, Dominican Republic.
  30. Quality-aware decoding for neural machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1396–1412, Association for Computational Linguistics, Seattle, United States.
  31. Learning to scaffold: Optimizing model explanations for teaching. In Advances in Neural Information Processing Systems.
  32. The Eval4NLP shared task on explainable quality estimation: Overview and results. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 165–178, Association for Computational Linguistics, Punta Cana, Dominican Republic.
  33. Translation error detection as rationale extraction. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4148–4159, Association for Computational Linguistics, Dublin, Ireland.
  34. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555.
  35. Freiesleben, Timo. 2021. The intriguing relation between counterfactual explanations and adversarial examples. Minds and Machines.
  36. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  37. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  38. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation, pages 733–774, Association for Computational Linguistics, Online.
  39. Freitas, Alex A. 2014. Comprehensible classification models: a position paper. ACM SIGKDD explorations newsletter, 15(1):1–10.
  40. Gptscore: Evaluate as you desire.
  41. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. J. Artif. Int. Res., 77.
  42. Visual interaction with deep learning models through collaborative semantic inference. IEEE Transactions on Visualization and Computer Graphics, 26(1):884–894.
  43. Explaining explanations: An overview of interpretability of machine learning.
  44. ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations.
  45. A survey of adversarial defences and robustness in nlp. ACM Comput. Surv. Just Accepted.
  46. Accurate evaluation of segment-level machine translation metrics. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1183–1191, Association for Computational Linguistics, Denver, Colorado.
  47. Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering, FirstView:1–28.
  48. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1059–1075, Association for Computational Linguistics, Dubrovnik, Croatia.
  49. A survey of methods for explaining black box models. ACM Comput. Surv., 51(5).
  50. Rationalization for explainable nlp: A survey. ArXiv, abs/2301.08912.
  51. Explaining black box predictions and unveiling data artifacts through influence functions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5553–5563, Association for Computational Linguistics, Online.
  52. Hase, Peter and Mohit Bansal. 2020. Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5540–5552, Association for Computational Linguistics, Online.
  53. On the blind spots of model-based evaluation metrics for text generation.
  54. Information fusion as an integrative cross-cutting enabler to achieve robust, explainable, and trustworthy medical artificial intelligence. Information Fusion, 79:263–278.
  55. Diagnosing ai explanation methods with folk concepts of behavior.
  56. Jacovi, Alon and Yoav Goldberg. 2020. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Association for Computational Linguistics, Online.
  57. Jacovi, Alon and Yoav Goldberg. 2021. Aligning faithful interpretations with their social attribution. Transactions of the Association for Computational Linguistics, 9:294–310.
  58. BERTTune: Fine-tuning neural machine translation with BERTScore. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 915–924, Association for Computational Linguistics, Online.
  59. Exploring chatgpt’s ability to rank content: A preliminary study on consistency with human preferences.
  60. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  61. BlonDe: An automatic evaluation metric for document-level machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1550–1565, Association for Computational Linguistics, Seattle, United States.
  62. Rethinking ai explainability and plausibility. ArXiv, abs/2303.17707.
  63. Logic traps in evaluating attribution scores. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5911–5922, Association for Computational Linguistics, Dublin, Ireland.
  64. Kabir, Tasnim and Marine Carpuat. 2021. The UMD submission to the explainable MT quality estimation shared task: Combining explanation models with sequence labeling. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 230–237, Association for Computational Linguistics, Punta Cana, Dominican Republic.
  65. DATScore: Evaluating translation with data augmented translations. In Findings of the Association for Computational Linguistics: EACL 2023, pages 942–952, Association for Computational Linguistics, Dubrovnik, Croatia.
  66. DEMETR: Diagnosing evaluation metrics for translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9540–9561, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates.
  67. Global explainability of BERT-based evaluation metrics by disentangling along linguistic factors. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8912–8925, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
  68. Kocmi, Tom and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality.
  69. Interacting with predictions: Visual inspection of black-box machine learning models. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 5686–5697.
  70. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th international conference on intelligent user interfaces, pages 126–137.
  71. “why is’ chicago’deceptive?” towards building model-driven tutorials for humans. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–13.
  72. Towards explainable evaluation metrics for natural language generation.
  73. Bmx: Boosting machine translation metrics with explainability.
  74. Leiter, Christoph Wolfgang. 2021. Reference-free word- and sentence-level translation evaluation with token-matching metrics. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 157–164, Association for Computational Linguistics, Punta Cana, Dominican Republic.
  75. Supporting complaints investigation for nursing and midwifery regulatory agencies. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 81–91, Association for Computational Linguistics, Online.
  76. Lertvittayakumjorn, Piyawat and Francesca Toni. 2021. Explanation-Based Human Debugging of NLP Models: A Survey. Transactions of the Association for Computational Linguistics, 9:1508–1528.
  77. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Association for Computational Linguistics, Online.
  78. Trustworthy ai: From principles to practices. ACM Comput. Surv., 55(9).
  79. Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1).
  80. Lipton, Zachary. 2016. The mythos of model interpretability. Communications of the ACM, 61.
  81. Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1073–1094, Association for Computational Linguistics, Minneapolis, Minnesota.
  82. G-eval: Nlg evaluation using gpt-4 with better human alignment.
  83. Multidimensional quality metrics (mqm): A framework for declaring and describing translation quality metrics. Tradumàtica: tecnologies de la traducció, 0:455–463.
  84. Toward human-like evaluation for natural language generation with error analysis.
  85. Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt.
  86. Lundberg, Scott M and Su-In Lee. 2017. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pages 4765–4774.
  87. Teaching categories to human learners with visual explanations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3820–3828.
  88. Post-hoc interpretability for neural NLP: A survey. ACM Computing Surveys, 55(8):1–42.
  89. Scientific credibility of machine translation research: A meta-evaluation of 769 papers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7297–7306, Association for Computational Linguistics, Online.
  90. Results of the WMT20 metrics shared task. In Proceedings of the Fifth Conference on Machine Translation, pages 688–725, Association for Computational Linguistics, Online.
  91. Miller, Tim. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38.
  92. Scigen: a dataset for reasoning-aware text generation from scientific tables. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  93. A quality estimation and quality evaluation tool for the translation industry. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 307–308, European Association for Machine Translation, Ghent, Belgium.
  94. OpenAI. 2023a. Gpt-4 technical report. ArXiv, abs/2303.08774.
  95. OpenAI. 2023b. Introducing chatgpt. URL https://openai.com/blog/chatgpt. (Date accessed: 24.04.2023).
  96. Opitz, Juri and Anette Frank. 2022. SBERT studies meaning representations: Decomposing sentence embeddings into explainable semantic features. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 625–638, Association for Computational Linguistics, Online only.
  97. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA.
  98. Peyrard, Maxime. 2019. Studying summarization evaluation metrics in the appropriate scoring range. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5093–5100, Association for Computational Linguistics, Florence, Italy.
  99. Trust building with explanation interfaces. In Proceedings of the 11th international conference on Intelligent user interfaces, pages 93–100.
  100. Xnlp: A living survey for xai research in natural language processing. In 26th International Conference on Intelligent User Interfaces - Companion, IUI ’21 Companion, page 78–80, Association for Computing Machinery, New York, NY, USA.
  101. Language models are unsupervised multitask learners.
  102. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Association for Computational Linguistics, Florence, Italy.
  103. TransQuest at WMT2020: Sentence-level direct assessment. In Proceedings of the Fifth Conference on Machine Translation, pages 1049–1055, Association for Computational Linguistics, Online.
  104. TransQuest: Translation quality estimation with cross-lingual transformers. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5070–5081, International Committee on Computational Linguistics, Barcelona, Spain (Online).
  105. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  106. The inside story: Towards better understanding of machine translation neural evaluation metrics. ArXiv, abs/2305.11806.
  107. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Association for Computational Linguistics, Online.
  108. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  109. “why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 97–101, Association for Computational Linguistics, San Diego, California.
  110. Anchors: High-precision model-agnostic explanations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
  111. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Association for Computational Linguistics, Online.
  112. Error identification for machine translation with metric embedding and attention. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 146–156, Association for Computational Linguistics, Punta Cana, Dominican Republic.
  113. An explainable ai decision-support-system to automate loan underwriting. Expert Systems with Applications, 144:113100.
  114. Explainability of text processing and retrieval methods: A critical survey. ArXiv, abs/2212.07126.
  115. Perturbation CheckLists for evaluating NLG evaluation metrics. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7219–7234, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
  116. A survey of evaluation metrics used for nlg systems. ACM Comput. Surv., 55(2).
  117. Neuron-level Interpretation of Deep NLP Models: A Survey. Transactions of the Association for Computational Linguistics, 10:1285–1303.
  118. Analyzing encoded concepts in transformer language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3082–3101, Association for Computational Linguistics, Seattle, United States.
  119. Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673.
  120. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. ITU Journal: ICT Discoveries - Special Issue 1 - The Impact of Artificial Intelligence (AI) on Communication Networks and Services, 1(1):39–48.
  121. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Association for Computational Linguistics, Online.
  122. Certifai: A common framework to provide explanations and analyse the fairness and robustness of black-box models. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society.
  123. Investigating the helpfulness of word-level quality estimation for post-editing machine translation output. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10173–10185, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
  124. Specia, Lucia and Atefeh Farzindar. 2010. Estimating machine translation post-editing effort with HTER. In Proceedings of the Second Joint EM+/CNGL Workshop: Bringing MT to the User: Research on Integrating MT in the Translation Industry, pages 33–43, Association for Machine Translation in the Americas, Denver, Colorado, USA.
  125. Quality Estimation for Machine Translation, chapter Quality Estimation for other Applications. Springer International Publishing, Cham. 10.1007/978-3-031-02168-8_5.
  126. An operation sequence model for explainable neural machine translation. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 175–186, Association for Computational Linguistics, Brussels, Belgium.
  127. BERTScore is unfair: On social bias in language model-based metrics for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3726–3739, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates.
  128. Interpreting deep learning models in natural language processing: A review. ArXiv, abs/2110.10470.
  129. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, Icml’17, pages 3319–3328, JMLR.org.
  130. CrossQE: HW-TSC 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 646–652, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  131. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations.
  132. Thompson, Brian and Matt Post. 2020. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 90–121, Association for Computational Linguistics, Online.
  133. The relationship between trust in ai and trustworthy machine learning technologies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 272–283.
  134. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  135. IST-unbabel 2021 submission for the explainable quality estimation shared task. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 133–145, Association for Computational Linguistics, Punta Cana, Dominican Republic.
  136. Adaptive quality estimation for machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 710–720, Association for Computational Linguistics, Baltimore, Maryland.
  137. Vaughan, Jennifer Wortman and Hanna Wallach. 2021. A Human-Centered Agenda for Intelligible Machine Learning. In Machines We Trust: Perspectives on Dependable AI. The MIT Press.
  138. Vijayakumar, Soniya. 2023. Interpretability in activation space analysis of transformers: A focused survey. ArXiv, abs/2302.09304.
  139. Vilone, Giulia and Luca Longo. 2021. Notions of explainability and evaluation approaches for explainable artificial intelligence. Information Fusion, 76:89–106.
  140. Layer or representation space: What makes BERT-based evaluation metrics robust? In Proceedings of the 29th International Conference on Computational Linguistics, pages 3401–3411, International Committee on Computational Linguistics, Gyeongju, Republic of Korea.
  141. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. SSRN Electronic Journal.
  142. Is chatgpt a good nlg evaluator? a preliminary study.
  143. A fine-grained interpretability evaluation benchmark for neural NLP. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 70–84, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  144. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  145. Wiegreffe, Sarah and Ana Marasovic. 2021. Teach me to explain: A review of datasets for explainable natural language processing. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  146. Wiegreffe, Sarah and Yuval Pinter. 2019. Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20, Association for Computational Linguistics, Hong Kong, China.
  147. A study of reinforcement learning for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3612–3621, Association for Computational Linguistics, Brussels, Belgium.
  148. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6707–6723, Association for Computational Linguistics, Online.
  149. Instructscore: Towards explainable text generation evaluation with automatic feedback.
  150. Tree of thoughts: Deliberate problem solving with large language models.
  151. Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems, volume 34, pages 27263–27277, Curran Associates, Inc.
  152. Findings of the wmt 2022 shared task on quality estimation. In Proceedings of the Seventh Conference on Machine Translation, Association for Computational Linguistics, Abu Dhabi.
  153. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  154. Effect of confidence and explanation on accuracy and trust calibration in ai-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 295–305.
  155. On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1656–1671, Association for Computational Linguistics, Online.
  156. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Association for Computational Linguistics, Hong Kong, China.
  157. Discoscore: Evaluating text generation with bert and discourse coherence. In EACL.
  158. Zini, Julia El and Mariette Awad. 2022. On the explainability of natural language processing deep models. ACM Comput. Surv., 55(5).
  159. Poor man’s quality estimation: Predicting reference-based MT metrics without the reference. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1311–1325, Association for Computational Linguistics, Dubrovnik, Croatia.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Christoph Leiter (13 papers)
  2. Piyawat Lertvittayakumjorn (14 papers)
  3. Marina Fomicheva (11 papers)
  4. Wei Zhao (309 papers)
  5. Yang Gao (761 papers)
  6. Steffen Eger (90 papers)
Citations (10)