Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMD Two-sample Testing in the Presence of Arbitrarily Missing Data (2405.15531v1)

Published 24 May 2024 in stat.ME

Abstract: In many real-world applications, it is common that a proportion of the data may be missing or only partially observed. We develop a novel two-sample testing method based on the Maximum Mean Discrepancy (MMD) which accounts for missing data in both samples, without making assumptions about the missingness mechanism. Our approach is based on deriving the mathematically precise bounds of the MMD test statistic after accounting for all possible missing values. To the best of our knowledge, it is the only two-sample testing method that is guaranteed to control the Type I error for both univariate and multivariate data where data may be arbitrarily missing. Simulation results show that our method has good statistical power, typically for cases where 5% to 10% of the data are missing. We highlight the value of our approach when the data are missing not at random, a context in which either ignoring the missing values or using common imputation methods may not control the Type I error.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. N. Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
  2. MMD-fuse: Learning and combining kernels for two-sample testing without data splitting. Advances in Neural Information Processing Systems, 36, 2024.
  3. D. A. Bodenham and Y. Kawahara. euMMD: Efficiently computing the mmd two-sample test statistic for univariate data. Statistics and Computing, 33(5):110, 2023.
  4. Mann-Whitney test with adjustments to pretreatment variables for missing values and observational study. Journal of the Royal Statistical Society Series B: Statistical Methodology, 75(1):81–102, 2013.
  5. Y. K. Cheung. Exact two-sample inference with missing data. Biometrics, 61(2):524–531, 2005.
  6. Rank-based two-sample tests for paired data with missing values. Biostatistics, 19(3):281–294, 2018.
  7. Characteristic kernels on groups and semigroups. Advances in neural information processing systems, 21, 2008.
  8. H. Gao and X. Shao. Two sample testing in high dimension via maximum mean discrepancy. Journal of Machine Learning Research, 24(304):1–33, 2023.
  9. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  10. A fast, consistent kernel two-sample test. Advances in neural information processing systems, 22, 2009.
  11. Handling missing data in clinical research. Journal of clinical epidemiology, 151:185–188, 2022.
  12. Performance hypothesis testing with the sharpe and treynor measures. Journal of Finance, pages 889–908, 1981.
  13. A. N. Kolmogorov. Sulla determinazione empirica di una legge didistribuzione. Giorn Dell’inst Ital Degli Att, 4:89–91, 1933.
  14. J. M. Lachin. Worst-rank score analysis with informatively missing observations in clinical trials. Controlled clinical trials, 20(5):408–422, 1999.
  15. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  16. Statistical analysis with missing data, volume 793. John Wiley & Sons, 2019.
  17. Ž. Lukić and B. Milošević. A novel two-sample test within the space of symmetric positive definite matrix distributions and its application in finance. Annals of the Institute of Statistical Mathematics, pages 1–24, 2024.
  18. On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, pages 50–60, 1947.
  19. Hypothesis test for paired samples in the presence of missing data. Journal of Applied Statistics, 40(1):76–87, 2013.
  20. A multiple testing procedure for clinical trials. Biometrics, pages 549–556, 1979.
  21. W. Pan. A two-sample test with interval censored data via multiple imputation. Statistics in Medicine, 19(1):1–11, 2000.
  22. R. Pfister and M. Janczyk. Confidence intervals for two sample means: Calculation, interpretation, and a few simple rules. Advances in Cognitive Psychology, 9(2):74, 2013.
  23. Energy distance. wiley interdisciplinary reviews: Computational statistics, 8(1):27–38, 2016.
  24. D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
  25. D. B. Rubin and N. Schenker. Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American statistical Association, 81(394):366–374, 1986.
  26. J. L. Schafer. Multiple imputation: a primer. Statistical methods in medical research, 8(1):3–15, 1999.
  27. Missing data: our view of the state of the art. Psychological methods, 7(2):147, 2002.
  28. MMD aggregated two-sample test. Journal of Machine Learning Research, 24(194):1–81, 2023.
  29. Testing equality of mean vectors in two sample problem with missing data. Communications in Statistics—Simulation and Computation®, 39(3):487–500, 2010.
  30. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12(7), 2011.
  31. Student. The probable error of a mean. Biometrika, pages 1–25, 1908.
  32. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Statistics in medicine, 27(23):4658–4677, 2008.
  33. On two-sample testing for data with arbitrarily missing values. arXiv preprint arXiv:2403.15327, 2024.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com