Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Ranking-based Tests of Independence (2403.07464v1)

Published 12 Mar 2024 in math.ST, stat.ME, stat.ML, and stat.TH

Abstract: In this paper we develop a novel nonparametric framework to test the independence of two random variables $\mathbf{X}$ and $\mathbf{Y}$ with unknown respective marginals $H(dx)$ and $G(dy)$ and joint distribution $F(dx dy)$, based on {\it Receiver Operating Characteristic} (ROC) analysis and bipartite ranking. The rationale behind our approach relies on the fact that, the independence hypothesis $\mathcal{H}_0$ is necessarily false as soon as the optimal scoring function related to the pair of distributions $(H\otimes G,\; F)$, obtained from a bipartite ranking algorithm, has a ROC curve that deviates from the main diagonal of the unit square.We consider a wide class of rank statistics encompassing many ways of deviating from the diagonal in the ROC space to build tests of independence. Beyond its great flexibility, this new method has theoretical properties that far surpass those of its competitors. Nonasymptotic bounds for the two types of testing errors are established. From an empirical perspective, the novel procedure we promote in this paper exhibits a remarkable ability to detect small departures, of various types, from the null assumption $\mathcal{H}_0$, even in high dimension, as supported by the numerical experiments presented here.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Generalization bounds for the area under the ROC curve. Journal of Machine Learning Research, 6:393–425, 2005.
  2. Adaptive test of independence based on HSIC measures. The Annals of Statistics, 50(2):858 – 879, 2022.
  3. Nonparametric independence testing via mutual information. Biometrika, 106(3):547–566, 2019.
  4. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems, volume 19. MIT Press, 2007.
  5. Adaptive partitioning schemes for bipartite ranking. Machine Learning, 43(1):31–69, 2011.
  6. An empirical comparison of learning algorithms for nonparametric scoring: the treerank algorithm and other methods. Pattern Analysis and its Applications, 16(4):475–496, 2013.
  7. Ranking Forests. Journal of Machine Learning Research, 14:39–73, 2013.
  8. Concentration inequalities for two-sample rank processes with application to bipartite ranking. Electronic Journal of Statistics, 15(2):4659 – 4717, 2021.
  9. Ranking and empirical risk minimization of U-statistics. The Annals of Statistics, 36(2):844–874, 2008.
  10. S. Clémençon and N. Vayatis. Ranking the best instances. Journal of Machine Learning Research, 8:2671–2699, 2007.
  11. S. Clémençon and N. Vayatis. Tree-based ranking methods. IEEE Transactions on Information Theory, 55(9):4316–4336, 2009.
  12. S. Clémençon and N. Vayatis. Overlaying classifiers: a practical approach to optimal scoring. Constructive Approximation, 32(3):619–648, 2010.
  13. A bipartite ranking approach to the two-sample problem. arXiv:2302.03592, 2023.
  14. N. Deb and B. Sen. Multivariate rank-based distribution-free nonparametric testing using measure transportation. Journal of the American Statistical Association, pages 1–16, 2021.
  15. J-D. Fermanian and F. Guegan. Fair learning with bagging, 2021. Documents de travail du Centre d’Économie de la Sorbonne 2021.34 - ISSN : 1955-611X.
  16. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003.
  17. E. Giné and A. Guillou. On consistency of kernel density estimators for randomly censored data: rates holding uniformly over adaptive intervals. Annales de l’Institut Henri Poincare (B) Probability and Statistics, 37(4):503–522, 2001.
  18. Data-driven representations for testing independence: Modeling, analysis and connection with mutual information estimation. IEEE Transactions on Signal Processing, 70:158–173, 2021.
  19. A kernel method for the two-sample problem. In Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, 2007.
  20. A kernel two-sample problem. Journal of Machine Learning Research, 13:723–773, 2012.
  21. Measuring statistical dependence with hilbert-schmidt norms. In Algorithmic Learning Theory: 16th International Conference,ALT 2005, volume 3734, pages 63–78, 2005.
  22. A kernel statistical test of independence. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
  23. A. Gretton and L. Györfi. Consistent nonparametric tests of independence. Journal of Machine Learning Research, 11(46):1391–1423, 2010.
  24. Kernel constrained covariance for dependence measurement. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, volume R5, pages 112–119. Proceedings of Machine Learning Research, 2005.
  25. E. J. Gumbel. Bivariate exponential distributions. Journal of the American Statistical Association, 55(292):698–707, 1960.
  26. J. Hájek and Z. Sidák. Theory of Rank Tests. Academic Press, 1967.
  27. M. Hallin. On Distribution and Quantile Functions, Ranks and Signs in Rdsubscript𝑅𝑑R_{d}italic_R start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Working Papers ECARES, 2017.
  28. Distribution and quantile functions, ranks and signs in dimension d: A measure transportation approach. The Annals of Statistics, 49(2):1139 – 1165, 2021.
  29. Consistent distribution-free k𝑘kitalic_k-sample and independence tests for univariate random variables. Journal of Machine Learning Research, 17:1–54, 2016.
  30. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  31. A high-dimensional convergence theorem for u-statistics with applications to kernel-based testing. In Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 3827–3918. Proceedings of Machine Learning Research, 2023.
  32. M. E. Jakobsen. Distance covariance in metric spaces: Non-parametric independence testing in metric spaces (master’s thesis), 2017.
  33. Turning the tables: Biased, imbalanced, dynamic tabular datasets for ML evaluation. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  34. W. C. M. Kallenberg and T. Ledwina. Data driven rank tests for independence. In Proceedings of the 51th session of the International Statistical Institute, number book 2 in Bulletin of the International Statistical Institute, pages 511–512, Netherlands, 1997. International Statistical Institute.
  35. W. C. M. Kallenberg and T. Ledwina. Data driven rank tests for independence. Journal of the American Statistical Association, 94(445):285–301, 1999.
  36. Testing bivariate independence and normality. Sankhya. Series A, 59(1):42–59, 1997.
  37. M.G. Kendall. Rank Correlation Methods. 4th Edition, Charles Griffin, London, 1975.
  38. B. Khavari and G. Rabusseau. Lower and upper bounds on the pseudo-dimension of tensor network models. In Advances in Neural Information Processing Systems, volume 34, pages 10931–10943. Curran Associates, Inc., 2021.
  39. A witness two-sample test. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 1403–1419. PMLR, 28–30 Mar 2022.
  40. Testing Statistical Hypotheses. 3third Edition, Springer, 2005.
  41. D. Leung and M. Drton. Testing independence in high dimensions with sums of rank correlations. The Annals of Statistics, 46(1):280 – 307, 2018.
  42. D. Lopez-Paz and M. Oquab. Revisiting classifier two-sample tests. In International Conference on Learning Representations, 2017.
  43. R. Lyons. Distance covariance in metric spaces. The Annals of Probability, 41(5):3284 – 3305, 2013.
  44. D. Nolan and D. Pollard. U𝑈Uitalic_U-Processes: Rates of Convergence. The Annals of Statistics, 15(2):780 – 799, 1987.
  45. The Methods of Distances in the Theory of Probability and Statistics. Springer, 2013.
  46. A. Rakotomamonjy. Optimizing area under roc curve with svms. In Proceedings of the First Workshop on ROC Analysis in AI, 2004.
  47. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, page 3571–3577. AAAI Press, 2015.
  48. Detecting novel associations in large data sets. Science, 334(6062):1518–1524, 2011.
  49. An empirical study of the maximal and total information coefficients and leading measures of dependence. The Annals of Applied Statistics, 12(1):123 – 155, 2018.
  50. Measuring dependence powerfully and equitably. Journal of Machine Learning Research, 17(211):1–63, 2016.
  51. C. Rudin. Ranking with a P-Norm Push. In Proceedings of COLT 2006, volume 4005 of Lecture Notes in Computer Science, pages 589–604, 2006.
  52. Margin-based ranking and boosting meet in the middle. In Proceedings of COLT 2005, volume 3559 of Lecture Notes in Computer Science, pages 63–78. Springer, 2005.
  53. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263 – 2291, 2013.
  54. On universally consistent and fully distribution-free rank tests of vector independence. The Annals of Statistics, 50(4):1933 – 1959, 2022.
  55. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769 – 2794, 2007.
  56. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249–1272, 2013.
  57. A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag New York, 1996.
  58. F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80–83, 1945.

Summary

We haven't generated a summary for this paper yet.