Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Binary Quantification and Dataset Shift: An Experimental Investigation (2310.04565v1)

Published 6 Oct 2023 in cs.LG and cs.AI

Abstract: Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https://github.com/pglez82/quant_datasetshift.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (8)
  1. Forman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European Conference on Machine Learning (ECML 2005), Porto, PT, pp 564–575, DOI 10.1007/11564096“˙55
  2. Forman G (2008) Quantifying counts and costs via classification. Data Mining and Knowledge Discovery 17(2):164–206, DOI 10.1007/s10618-008-0097-y
  3. Hofer V, Krempl G (2012) Drift mining in data: A framework for addressing drift in classification. Computational Statistics & Data Analysis 57(1):377–391
  4. Kottke D, Sandrock C, Krempl G, Sick B (2022) A stopping criterion for transductive active learning. In: Proceedings of the 33rd European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML / PKDD 2022), Grenoble, FR, pp 468–484, DOI 10.1007/978-3-031-26412-2“˙29
  5. Moreo A, Sebastiani F (2021) Re-assessing the “classify and count” quantification method. In: Proceedings of the 43rd European Conference on Information Retrieval (ECIR 2021), Lucca, IT, vol II, pp 75–91
  6. Sebastiani F (2020) Evaluation measures for quantification: An axiomatic approach. Information Retrieval Journal 23(3):255–288, DOI 10.1007/s10791-019-09363-y
  7. Tasche D (2017) Fisher consistency for prior probability shift. Journal of Machine Learning Research 18:95:1–95:32
  8. Tasche D (2022) Class prior estimation under covariate shift: No problem? arXiv:2206.02449 [stat.ML]
Citations (1)

Summary

We haven't generated a summary for this paper yet.