Papers
Topics
Authors
Recent
2000 character limit reached

Student's t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce (2303.04526v2)

Published 8 Mar 2023 in cs.CL, cs.IT, cs.NA, math.IT, math.NA, and stat.AP

Abstract: In NLP we always rely on human judgement as the golden quality evaluation method. However, there has been an ongoing debate on how to better evaluate inter-rater reliability (IRR) levels for certain evaluation tasks, such as translation quality evaluation (TQE), especially when the data samples (observations) are very scarce. In this work, we first introduce the study on how to estimate the confidence interval for the measurement value when only one data (evaluation) point is available. Then, this leads to our example with two human-generated observational scores, for which, we introduce ``Student's \textit{t}-Distribution'' method and explain how to use it to measure the IRR score using only these two data points, as well as the confidence intervals (CIs) of the quality evaluation. We give quantitative analysis on how the evaluation confidence can be greatly improved by introducing more observations, even if only one extra observation. We encourage researchers to report their IRR scores in all possible means, e.g. using Student's \textit{t}-Distribution method whenever possible; thus making the NLP evaluation more meaningful, transparent, and trustworthy. This \textit{t}-Distribution method can be also used outside of NLP fields to measure IRR level for trustworthy evaluation of experimental investigations, whenever the observational data is scarce. Keywords: Inter-Rater Reliability (IRR); Scarce Observations; Confidence Intervals (CIs); NLP; Translation Quality Evaluation (TQE); Student's \textit{t}-Distribution

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. JH Abbott and JI Rosenblatt. 1962. Two stage estimation with one observation on the first stage. Annals of the Institute of Statistical Mathematics, 14(1):229–235.
  2. Monte carlo modelling of confidence intervals in translation quality evaluation (tqe) and post-editing dstance (ped) measurement. In Metrics 2021: Workshop on Informetric and Scientometric Research (SIG-MET), 23-24 Oct 2021. Association for Information Science and Technology.
  3. A text mining model for answering checklist questions automatically from parasitology literature. In 2020 International Conference on Computing and Information Technology (ICCIT-1441), pages 1–5.
  4. Analyses of inter-rater reliability between professionals, medical students and trained school children as assessors of basic life support skills. BMC medical education, 16(1):1–8.
  5. Interrater reliability in systematic review methodology: Exploring variation in coder decision-making. Sociological Methods & Research, 50(2):837–865.
  6. Re-evaluating evaluation in text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9347–9359, Online. Association for Computational Linguistics.
  7. R Clifford Blair and James J Higgins. 1980. A comparison of the power of wilcoxon’s rank-sum statistic to that of student’s t statistic under various nonnormal distributions. Journal of Educational Statistics, 5(4):309–335.
  8. Alan B Cantor. 1996. Sample-size calculations for cohen’s kappa. Psychological methods, 1(2):150.
  9. Parthena Charalampidou and Serge Gladkoff. 2022. A case of application of a new human mt quality evaluation metric in the emt classroom. In New Trends in Translation and Technology (NeTTT) Conference, pages 161 – 165.
  10. Smiley W Cheng and James C Fu. 1983. An algorithm to obtain the critical values of the t, χ𝜒\chiitalic_χ2 and f distributions. Statistics & Probability Letters, 1(5):223–227.
  11. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
  12. Rosario Delgado and Xavier-Andoni Tibau. 2019. Why cohen’s kappa should be avoided as performance measure in classification. PloS one, 14(9):e0222916.
  13. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. arXiv e-prints, page arXiv:2104.14478.
  14. Confidence intervals and significance tests for a single trial. Communications in Statistics-Theory and Methods, 18(10):3749–3761.
  15. Serge Gladkoff and Lifeng Han. 2022. HOPE: A task-oriented and human-centric evaluation framework using professional post-editing towards more effective MT evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 13–21, Marseille, France. European Language Resources Association.
  16. Measuring uncertainty in translation quality evaluation (TQE). In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1454–1461, Marseille, France. European Language Resources Association.
  17. Can machine translation systems be evaluated by the crowd alone. Nat. Lang. Eng., 23(1):3–30.
  18. Kevin A Hallgren. 2012. Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in quantitative methods for psychology, 8(1):23.
  19. Lifeng Han. 2022a. An investigation into multi-word expressions in machine translation. Ph.D. thesis, Dublin City University.
  20. Lifeng Han. 2022b. An overview on machine translation evaluation. arXiv preprint arXiv:2202.11027.
  21. Lifeng Han and Serge Gladkoff. 2022. Meta-evaluation of translation evaluation methods: a systematic up-to-date overview. In Tutorial at LREC2022, Marseille, France.
  22. AlphaMWE: Construction of multilingual parallel corpora with MWE annotations. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 44–57, online. Association for Computational Linguistics.
  23. Translation quality assessment: A brief survey on manual and automatic methods. In Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age, pages 15–33, online. Association for Computational Linguistics.
  24. Naomi D Harvey. 2021. A simple guide to inter-rater, intra-rater and test-retest reliability for animal behaviour studies.
  25. Michael Krauthammer and Goran Nenadic. 2004. Term identification in the biomedical literature. Journal of biomedical informatics, 37(6):512–526.
  26. Klaus Krippendorff. 1987. Association, agreement, and equity. Quality and Quantity, 21(2):109–123.
  27. Klaus Krippendorff. 2011. Computing krippendorff’s alpha-reliability.
  28. Correlation between human assessment of essays and rouge evaluation of essays’ summaries. In 2009 Eighth International Symposium on Natural Language Processing, pages 122–127. IEEE.
  29. Multidimensional quality metrics (mqm): A framework for declaring and describing translation quality metrics. Tradumàtica: tecnologies de la traducció, 0:455–463.
  30. KHALED MILAD. 2022. Comparative evaluation of translation memory (tm) and machine translation (mt) systems in translation between arabic and english. In New Trends in Translation and Technology (NeTTT) Conference, pages 142–151.
  31. From web crawled text to project descriptions: automatic summarizing of social innovation projects. In International Conference on Applications of Natural Language to Information Systems, pages 157–169. Springer.
  32. Enhancing automatic term recognition through recognition of variation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 604–610.
  33. CC Rodriguez. 1996. Confidence intervals from one observation. In Maximum Entropy and Bayesian Methods: Cambridge, England, 1994 Proceedings of the Fourteenth International Workshop on Maximum Entropy and Bayesian Methods, pages 175–182. Springer.
  34. Student. 1908. The probable error of a mean. Biometrika, 6(1):1–25.
  35. Confidence intervals from single observations in forest research. Forest science, 37(1):370–373.
  36. Cindy M Walker and Sakine Göçer Şahin. 2020. Using differential item functioning to test for interrater reliability in constructed response items. Educational and Psychological Measurement, 80(4):808–820.
  37. Cross-replication reliability - an empirical approach to interpreting inter-rater reliability. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7053–7065, Online. Association for Computational Linguistics.
  38. On cross-domain pre-trained language models for clinical text mining: How do they perform on data-constrained fine-tuning? In arXiv:2210.12770 [cs.CL].
  39. A text mining approach to the prediction of disease status from clinical discharge summaries. Journal of the American Medical Informatics Association, 16(4):596–600.
  40. Mining a stroke knowledge graph from literature. BMC bioinformatics, 22(10):1–19.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.