Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Perturbation-based Effect Measures for Compositional Data (2311.18501v5)

Published 30 Nov 2023 in stat.ME, math.ST, stat.ML, and stat.TH

Abstract: Existing effect measures for compositional features are inadequate for many modern applications, for example, in microbiome research, since they display traits such as high-dimensionality and sparsity that can be poorly modelled with traditional parametric approaches. Further, assessing -- in an unbiased way -- how summary statistics of a composition (e.g., racial diversity) affect a response variable is not straightforward. We propose a framework based on hypothetical data perturbations which defines interpretable statistical functionals on the compositions themselves, which we call average perturbation effects. These effects naturally account for confounding that biases frequently used marginal dependence analyses. We show how average perturbation effects can be estimated efficiently by deriving a perturbation-dependent reparametrization and applying semiparametric estimation techniques. We analyze the proposed estimators empirically on simulated and semi-synthetic data and demonstrate advantages over existing techniques on data from New York schools and microbiome data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. J. Aitchison. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160, 1982.
  2. J. Aitchison. The Statistical Analysis of Compositional Data. Monographs on statistics and applied probability. Blackburn Press, 2003. ISBN 9781930665781.
  3. J. Aitchison and J. Bacon-Shone. Log contrast models for experiments with mixtures. Biometrika, 71(2):323–330, 1984. doi: 10.1093/biomet/71.2.323.
  4. Effects of racial diversity on complex thinking in college students. Psychological Science, 15(8):507–510, 2004. doi: 10.1111/j.0956-7976.2004.00710.x.
  5. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019. doi: 10.1214/18-AOS1709.
  6. B. Becker and R. Kohavi. Adult. UCI Machine Learning Repository, 1996. doi: 10.24432/C5XW20.
  7. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, 1993.
  8. Tree-aggregated predictive modeling of microbiome data. Scientific Reports, 11(1), 2021. ISSN 2045-2322. doi: 10.1038/s41598-021-93645-3.
  9. Foundations of structural causal models with cycles and latent variables. The Annals of Statistics, 49(5):2885–2915, 2021. doi: 10.1214/21-AOS2064.
  10. L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. ISSN 0885-6125. doi: 10.1023/a:1010933404324.
  11. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018. ISSN 1368-4221. doi: 10.1111/ectj.12097.
  12. Debiased machine learning of global and local parameters using regularized Riesz representers. The Econometrics Journal, 25(3):576–601, 2022. ISSN 1368-4221. doi: 10.1093/ectj/utac002.
  13. D. Cox. A penalty method for nonparametric estimation of the logarithmic derivative of a density function. Annals of the Institute of Statistical Mathematics, 37:271–288, 1985. doi: 10.1007/BF02481097.
  14. Derivative estimation with local polynomial fitting. Journal of Machine Learning Research, 14(1):281–301, 2013.
  15. J. J. Egozcue and V. Pawlowsky-Glahn. Simplicial geometry for compositional data. Geological Society, London, Special Publications, 264(1):145–159, 2006. doi: 10.1144/GSL.SP.2006.264.01.11.
  16. Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35(3):279–300, 2003. doi: 10.1023/A:1023818214614.
  17. M. Greenacre. Power transformations in correspondence analysis. Computational Statistics and Data Analysis, 53(8):3107–3116, 2009. ISSN 0167-9473. doi: 10.1016/j.csda.2008.09.001.
  18. Aitchison’s compositional data analysis 40 years on: A reappraisal. Statistical Science, 1(1):1–25, 2023. doi: 10.1214/22-STS880.
  19. A rarefaction-based extension of the LDM for testing presence-absence associations in the microbiome. Bioinformatics, 37(12):1652–1657, 2021. ISSN 1367-4803. doi: 10.1093/bioinformatics/btab012.
  20. Supervised learning and model analysis with compositional data. PLOS Computational Biology, 19(6):1–19, 2023. doi: 10.1371/journal.pcbi.1011240.
  21. E. H. Kennedy. Semiparametric doubly robust targeted double machine learning: a review. arXiv preprint, 2023. doi: 10.48550/arXiv.2203.06469.
  22. H. Klyne and R. D. Shah. Average partial effect estimation using double machine learning. arXiv preprint, 2023. doi: 10.48550/arXiv.2308.09207.
  23. J. M. Lee. Introduction to Smooth Manifolds. Springer New York, 2012. doi: 10.1007/978-1-4419-9982-5.
  24. B. Li and J. Ahn. Reproducing kernels and new approaches in compositional data analysis. arXiv preprint, 2022. doi: 10.48550/arXiv.2205.01158.
  25. Effect of diversity on human resource management and organizational performance. Journal of Business Research, 68(4):857–861, 2015. ISSN 0148-2963. doi: 10.1016/j.jbusres.2014.11.041.
  26. Comparison of zero replacement strategies for compositional data with large numbers of zeros. Chemometrics and Intelligent Laboratory Systems, 210:104248, 2021. doi: doi.org/10.1016/j.chemolab.2021.104248.
  27. Applications of machine learning in human microbiome studies: A review on feature selection, biomarker identification, disease prediction and treatment. Frontiers in Microbiology, 12, 2021. ISSN 1664-302X. doi: 10.3389/fmicb.2021.634511.
  28. Zero replacement in compositional data sets. In Data analysis, classification, and related methods, pages 155–160. Springer, 2000.
  29. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology, 35(3):253–278, 2003.
  30. Bayesian-multiplicative treatment of count zeros in compositional data sets. Statistical Modelling, 15(2):134–158, 2015. doi: 10.1177/1471082X14535524.
  31. E. Masry and J. Fan. Local polynomial estimation of regression functions for mixing processes. Scandinavian Journal of Statistics, 24(2):165–179, 1997. doi: 10.1111/1467-9469.00056.
  32. American gut: an open platform for citizen science microbiome research. mSystems, 3(3):10.1128/msystems.00031–18, 2018. doi: 10.1128/msystems.00031-18.
  33. N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7(6), 2006.
  34. Efficiency of weighted average derivative estimators and index models. Econometrica, 61(5):1199–1223, 1993. ISSN 00129682, 14680262.
  35. P. T. Ng. Smoothing spline score estimation. SIAM Journal on Scientific Computing, 15(5):1003–1025, 1994. doi: 10.1137/0915061.
  36. Heterogeneity, income inequality, and social capital: A new perspective. Social Science Quarterly, 99(2):699–710, 2018. doi: 10.1111/ssqu.12454.
  37. J. Palarea-Albaladejo and J.-A. Martín-Fernández. A modified em alr-algorithm for replacing rounded zeros in compositional data sets. Computers & Geosciences, 34(8):902–917, 2008.
  38. J. Pearl. Causality. Cambridge university press, 2009.
  39. K. Pearson. Mathematical contributions to the theory of evolution.– On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proceedings of the Royal Society of London, 60(359-367):489–498, 1897.
  40. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  41. J. Pfanzagl. Contributions to a General Asymptotic Statistical Theory. Springer New York, 1982. doi: 10.1007/978-1-4612-5769-1.
  42. O. C. Richard. Racial diversity, business strategy, and firm performance: A resource-based view. Academy of Management Journal, 43(2):164–177, 2000. doi: 10.5465/1556374.
  43. The impact of racial diversity on intermediate and long-term performance: The moderating role of environmental context. Strategic Management Journal, 28(12):1213–1233, 2007. doi: 10.1002/smj.633.
  44. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846–866, 1994. doi: 10.1080/01621459.1994.10476818.
  45. P. M. Robinson. Root-n-consistent semiparametric regression. Econometrica, 56(4):931–954, 1988. ISSN 00129682, 14680262.
  46. D. B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005.
  47. W. Rudin. Real and Complex Analysis. McGraw-Hill Science/Engineering/Math, 1986.
  48. Regression for compositional data by using distributions defined on the hypersphere. Journal of the Royal Statistical Society Series B: Statistical Methodology, 73(3):351–375, 2011. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2010.00766.x.
  49. R. L. Schilling. Measures, Integrals and Martingales. Cambridge University Press, Cambridge, United Kingdom New York, NY, 2 edition, 2017. ISBN 9781316620243.
  50. R. D. Shah and J. Peters. The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics, 48(3):1514 – 1538, 2020. doi: 10.1214/19-AOS1857.
  51. I. Shin. Income inequality and economic growth. Economic Modelling, 29(5):2049–2057, 2012.
  52. c-lasso - a python package for constrained sparse and robust regression and classification. Journal of Open Source Software, 6(57):2844, 2021. doi: 10.21105/joss.02844.
  53. A. A. Tsiatis. Semiparametric Theory and Missing Data. Springer, 2006.
  54. Nonparametric variable importance assessment using machine learning techniques. Biometrics, 77(1):9–22, 2021. doi: 10.1111/biom.13392.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets