Papers
Topics
Authors
Recent
Search
2000 character limit reached

Analysis of Systems' Performance in Natural Language Processing Competitions

Published 7 Mar 2024 in cs.LG | (2403.04693v2)

Abstract: Collaborative competitions have gained popularity in the scientific and technological fields. These competitions involve defining tasks, selecting evaluation scores, and devising result verification methods. In the standard scenario, participants receive a training set and are expected to provide a solution for a held-out dataset kept by organizers. An essential challenge for organizers arises when comparing algorithms' performance, assessing multiple participants, and ranking them. Statistical tools are often used for this purpose; however, traditional statistical methods often fail to capture decisive differences between systems' performance. This manuscript describes an evaluation methodology for statistically analyzing competition results and competition. The methodology is designed to be universally applicable; however, it is illustrated using eight natural language competitions as case studies involving classification and regression problems. The proposed methodology offers several advantages, including off-the-shell comparisons with correction mechanisms and the inclusion of confidence intervals. Furthermore, we introduce metrics that allow organizers to assess the difficulty of competitions. Our analysis shows the potential usefulness of our methodology for effectively evaluating competition results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. VaxxStance@IberLEF 2021: Overview of the Task on Going Beyond Text in Cross-Lingual Stance Detection. Procesamiento del Lenguaje Natural, 67(0):173–181.
  2. Overview of Rest-Mex at IberLEF 2021: Recommendation System for Text Mexican Tourism. Procesamiento del Lenguaje Natural, 67(0):163–172.
  3. Overview of Rest-Mex at IberLEF 2022: Recommendation System, Sentiment Analysis and Covid Semaphore Prediction for Mexican Tourist Texts. Procesamiento del Lenguaje Natural, 69(0):289–299.
  4. Overview of MEX-A3T at IberLEF 2019: Authorship and aggressiveness analysis in Mexican Spanish tweets. CEUR Workshop Proceedings, 2421:478–494.
  5. Overview of PAR-MEX at Iberlef 2022: Paraphrase Detection in Spanish Shared Task. Procesamiento del Lenguaje Natural, 69(0):255–263.
  6. An empirical investigation of statistical significance in NLP. In EMNLP-CoNLL 2012 - 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Proceedings of the Conference.
  7. Bootstrap estimates for confidence intervals in ASR performance evaluation. 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1.
  8. An introduction to bootstrap methods with applications to R. Wiley.
  9. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7(1):1–30.
  10. Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895–1923.
  11. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis., 88(2):303–338.
  12. An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research, 9:2677–2694.
  13. Overview of TASS 2020: Introducing Emotion Detection. CEUR Workshop Proceedings, 2664:163–170.
  14. Why, When and How to Adjust Your P Values? Cell Journal (Yakhteh), 20(4):604.
  15. Koehn, P. (2004). Statistical Significance Tests for Machine Translation Evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004 - A meeting of SIGDAT, a Special Interest Group of the ACL held in conjunction with ACL 2004, pages 388–395.
  16. Comparison of classifiers in challenge scheme. In Rodríguez-González, A. Y., Pérez-Espinosa, H., Martínez-Trinidad, J. F., Carrasco-Ochoa, J. A., and Olvera-López, J. A., editors, Pattern Recognition, pages 89–98, Cham. Springer Nature Switzerland.
  17. Codalab competitions: An open source platform to organize scientific challenges. Journal of Machine Learning Research, 24(198):1–6.
  18. Overview of MeOffendEs at IberLEF 2021: Offensive Language Detection in Spanish Variants. Procesamiento del Lenguaje Natural, 67(0):183–194.
  19. Overview of EXIST 2021: sEXism Identification in Social neTworks. Procesamiento del Lenguaje Natural, 67(0):195–207.
  20. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 115(3):211–252.
  21. What’s in a p-value in NLP? CoNLL 2014 - 18th Conference on Computational Natural Language Learning, Proceedings, pages 1–10.
  22. Overview of DETOXIS at IberLEF 2021: DEtection of TOXicity in comments In Spanish. Procesamiento del Lenguaje Natural, 67(0):209–221.
  23. Yoav Benjamini and Yosef Hochberg (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300.
  24. Interpreting BLEU/NIST Scores: How Much Improvement do We Need to Have a Better System? In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004, pages 2051–2054.
Citations (2)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.