Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Kernel Density Estimation for Multiclass Quantification (2401.00490v2)

Published 31 Dec 2023 in cs.LG and stat.ML

Abstract: Several disciplines, like the social sciences, epidemiology, sentiment analysis, or market research, are interested in knowing the distribution of the classes in a population rather than the individual labels of the members thereof. Quantification is the supervised machine learning task concerned with obtaining accurate predictors of class prevalence, and to do so particularly in the presence of label shift. The distribution-matching (DM) approaches represent one of the most important families among the quantification methods that have been proposed in the literature so far. Current DM approaches model the involved populations by means of histograms of posterior probabilities. In this paper, we argue that their application to the multiclass setting is suboptimal since the histograms become class-specific, thus missing the opportunity to model inter-class information that may exist in the data. We propose a new representation mechanism based on multivariate densities that we model via kernel density estimation (KDE). The experiments we have carried out show our method, dubbed KDEy, yields superior quantification performance with respect to previous DM approaches. We also investigate the KDE-based representation within the maximum likelihood framework and show KDEy often shows superior performance with respect to the expectation-maximization method for quantification, arguably the strongest contender in the quantification arena to date.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), pages 222–232, Virtual Event, 2020.
  2. Quantification via probability estimators. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM 2010), pages 737–742, Sydney, AU, 2010.
  3. Mirko Bunse. qunfold: Composable quantification and unfolding methods in Python. In Proceedings of the 3rd International Workshop on Learning to Quantify (LQ 2023), co-located at ECML-PKDD 2023, pages 1–7, Turin, IT, 2023. URL https://lq-2023.github.io/proceedings/CompleteVolume.pdf.
  4. Unification of algorithms for quantification and unfolding. In Proceedings of the 2nd International Workshop on Learning to Quantify (LQ 2022), pages 1–10, Grenoble, IT, 2022. URL https://lq-2022.github.io/proceedings/CompleteVolume.pdf.
  5. The importance of calibration for estimating proportions from annotations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2018), pages 1636–1646, New Orleans, US, 2018.
  6. An equivalence analysis of binary quantification methods. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 6944–6952, Washington, US, 2023.
  7. Sung-Hyuk Cha. Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 4(1):300–307, 2007.
  8. Yen-Chi Chen. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161–187, 2017.
  9. MC-SQ: A highly accurate ensemble for multi-class quantification. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), pages 622–630. SIAM, 2023.
  10. Marthinus C. du Plessis and Masashi Sugiyama. Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Networks, 50:110–119, 2014.
  11. Label shift quantification with robustness guarantees via distribution feature matching. In Machine Learning and Knowledge Discovery in Databases: Research Track, pages 69–85, Cham, 2023. Springer Nature Switzerland.
  12. Optimizing text quantifiers for multivariate loss functions. ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27, 2015.
  13. A detailed overview of LeQua 2022: Learning to quantify. In Working Notes of the 13th Conference and Labs of the Evaluation Forum (CLEF 2022), Bologna, IT, 2022.
  14. Learning to quantify. Springer Nature, Cham, CH, 2023.
  15. Aykut Firat. Unified framework for quantification. arXiv:1606.00868v1 [cs.LG], 2016.
  16. George Forman. Counting positives accurately despite inaccurate classification. In Proceedings of the 16th European Conference on Machine Learning (ECML 2005), pages 564–575, Porto, PT, 2005.
  17. George Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2):164–206, 2008.
  18. From classification to quantification in tweet sentiment analysis. Social Network Analysis and Mining, 6(19):1–22, 2016.
  19. A unified view of label shift estimation. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), pages 3290–3300, Virtual Event, 2020.
  20. A review on quantification learning. ACM Computing Surveys, 50(5):74:1–74:40, 2017.
  21. Automatic plankton quantification using deep features. Journal of Plankton Research, 41(4):449–463, 2019.
  22. Estimating class proportions in boar semen analysis using the Hellinger distance. In Proceedings of the 23rd International Conference on Industrial Engineering and other Applications of Applied Intelligent Systems (IEA/AIE 2010), pages 284–293, Córdoba, ES, 2010.
  23. Class distribution estimation based on the Hellinger distance. Information Sciences, 218:146–164, 2013.
  24. Accurately quantifying a billion instances per second. In Proceedings of the 7th IEEE International Conference on Data Science and Advanced Analytics (DSAA 2020), pages 1–10, Sydney, AU, 2020.
  25. Approximating the Kullback Leibler divergence between Gaussian mixture models. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2007, Honolulu, Hawaii, USA, April 15-20, 2007, pages 317–320. IEEE, 2007.
  26. Closed-form Cauchy-Schwarz PDF divergence for mixture of Gaussians. In The 2011 International Joint Conference on Neural Networks, pages 2578–2585. IEEE, 2011.
  27. The UCI Machine Learning Repository. https://archive.ics.uci.edu.
  28. Verbal autopsy methods with multiple causes of death. Statistical Science, 23(1):78–91, 2008.
  29. On estimating L22superscriptsubscriptabsent22{}_{2}^{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divergence. In Artificial Intelligence and Statistics, pages 498–506. PMLR, 2015.
  30. Detecting and correcting for label shift with black box predictors. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), pages 3128–3136, Stockholm, SE, 2018.
  31. On the need of class ratio insensitive drift tests for data streams. In Second international workshop on learning with imbalanced domains: theory and applications, pages 110–124. PMLR, 2018.
  32. DyS: A framework for mixture models in quantification. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI 2019), pages 4552–4560, Honolulu, US, 2019.
  33. Calculation of differential entropy for a mixed Gaussian distribution. Entropy, 10(3):200, 2008.
  34. Classifying and counting with recurrent contexts. In Proceedings of the 24th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2018), pages 1983–1992, London, UK, 2018.
  35. Re-assessing the “classify and count” quantification method. In Proceedings of the 43rd European Conference on Information Retrieval (ECIR 2021), volume II, pages 75–91, Lucca, IT, 2021.
  36. Tweet sentiment quantification: An experimental re-evaluation. PLOS ONE, 17(9):1–23, September 2022.
  37. QuaPy: A Python-based framework for quantification. In Proceedings of the 30th ACM International Conference on Knowledge Management (CIKM 2021), pages 4534–4543, Gold Coast, AU, 2021.
  38. Frank Nielsen. Closed-form information-theoretic divergences for statistical mixtures. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pages 1723–1726, 2012.
  39. Frank Nielsen. Non-negative Monte Carlo estimation of f-divergences, 2020.
  40. Frank Nielsen and Ke Sun. Guaranteed bounds on the Kullback–Leibler divergence of univariate mixtures. IEEE Signal Processing Letters, 23(11):1543–1546, 2016.
  41. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  42. Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34:87–100, 2017.
  43. Dynamic ensemble selection for quantification tasks. Information Fusion, 45:1–15, 2019.
  44. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14(1):21–41, 2002.
  45. A comparative evaluation of quantification methods, 2023.
  46. Fabrizio Sebastiani. Evaluation measures for quantification: An axiomatic approach. Information Retrieval Journal, 23(3):255–288, 2020.
  47. Amos Storkey. When training and test sets are different: Characterizing learning transfer. In Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence, editors, Dataset shift in machine learning, pages 3–28. The MIT Press, Cambridge, US, 2009.
  48. Classification on data with biased class distribution. In Proceedings of the 12th European Conference on Machine Learning (ECML 2001), pages 527–538, Freiburg, DE, 2001.
  49. Closed-form Jensen-Renyi divergence for mixture of Gaussians and applications to group-wise shape registration. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2009: 12th International Conference, London, UK, September 20-24, 2009, Proceedings, Part I 12, pages 648–655. Springer, 2009.

Summary

We haven't generated a summary for this paper yet.