Papers
Topics
Authors
Recent
Search
2000 character limit reached

On Orderings of Probability Vectors and Unsupervised Performance Estimation

Published 16 Jun 2023 in cs.LG | (2306.10160v1)

Abstract: Unsupervised performance estimation, or evaluating how well models perform on unlabeled data is a difficult task. Recently, a method was proposed by Garg et al. [2022] which performs much better than previous methods. Their method relies on having a score function, satisfying certain properties, to map probability vectors outputted by the classifier to the reals, but it is an open problem which score function is best. We explore this problem by first showing that their method fundamentally relies on the ordering induced by this score function. Thus, under monotone transformations of score functions, their method yields the same estimate. Next, we show that in the binary classification setting, nearly all common score functions - the $L\infty$ norm; the $L2$ norm; negative entropy; and the $L2$, $L1$, and Jensen-Shannon distances to the uniform vector - all induce the same ordering over probability vectors. However, this does not hold for higher dimensional settings. We conduct numerous experiments on well-known NLP data sets and rigorously explore the performance of different score functions. We conclude that the $L\infty$ norm is the most appropriate.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Leveraging unlabeled data to predict out-of-distribution performance. In International Conference on Learning Representations (ICLR), 2022.
  2. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010a.
  3. Impossibility theorems for domain adaptation. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 129–136, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010b. PMLR. URL https://proceedings.mlr.press/v9/david10a.html.
  4. Predicting with confidence on unseen distributions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1134–1144, 2021.
  5. Are labels always necessary for classifier accuracy evaluation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15069–15078, 2021.
  6. What does rotation prediction tell us about classifier accuracy under varying testing environments? In International Conference on Machine Learning, pages 2579–2589. PMLR, 2021.
  7. Assessing generalization of SGD via disagreement. In International Conference on Learning Representations, 2022. URL https://iclr.cc/virtual/2022/spotlight/6301.
  8. Mandoline: Model evaluation under distribution shift. In International Conference on Machine Learning, pages 1617–1629. PMLR, 2021.
  9. Estimating accuracy from unlabeled data: A bayesian approach. In International Conference on Machine Learning, pages 1416–1425. PMLR, 2016.
  10. Neural unsupervised domain adaptation in nlp—a survey. arXiv preprint arXiv:2006.00632, 2020.
  11. Stephen A Rhoades. The herfindahl-hirschman index. Fed. Res. Bull., 79:188, 1993.
  12. Edward H Simpson. Measurement of diversity. Nature, 163(4148):688–688, 1949.
  13. CARER: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1404. URL https://www.aclweb.org/anthology/D18-1404.
  14. Semeval 2018 task 2: Multilingual emoji prediction. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 24–33, 2018.
  15. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020, mar 2020. URL https://arxiv.org/abs/2003.04807. Data available at https://github.com/PolyAI-LDN/task-specific-datasets.
  16. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  17. Jacob Cohen. Statistical power analysis for the behavioral sciences. Routledge, 2013.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.