Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 70 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 72 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

A Bias-Variance Decomposition for Ensembles over Multiple Synthetic Datasets (2402.03985v3)

Published 6 Feb 2024 in cs.LG and stat.ML

Abstract: Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning, from increased accuracy to more effective model selection and uncertainty estimation. These benefits have clear empirical support, but the theoretical understanding of them is currently very light. We seek to increase the theoretical understanding by deriving bias-variance decompositions for several settings of using multiple synthetic datasets, including differentially private synthetic data. Our theory yields a simple rule of thumb to select the appropriate number of synthetic datasets in the case of mean-squared error and Brier score. We investigate how our theory works in practice with several real datasets, downstream predictors and error metrics. As our theory predicts, multiple synthetic datasets often improve accuracy, while a single large synthetic dataset gives at best minimal improvement, showing that our insights are practically relevant.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Data Augmentation Generative Adversarial Networks, 2018. URL http://arxiv.org/abs/1711.04340.
  2. Differentially Private Query Release Through Adaptive Projection. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 457–467. PMLR, 2021. URL http://proceedings.mlr.press/v139/aydore21a.html.
  3. L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200–217, 1967. URL https://www.sciencedirect.com/science/article/pii/0041555367900407.
  4. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. URL https://doi.org/10.1007/BF00058655.
  5. Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001. URL https://doi.org/10.1023/A:1010933404324.
  6. Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
  7. GS-WGAN: A gradient-sanitized approach for learning differentially private generators. In Advances in Neural Information Processing Systems, volume 33, pages 12673–12684, 2020. URL https://proceedings.neurips.cc/paper/2020/file/9547ad6b651e2087bac67651aa92cd0d-Paper.pdf.
  8. Self-Supervised Blind Image Deconvolution via Deep Generative Ensemble Learning. IEEE Transactions on Circuits and Systems for Video Technology, 33(2):634–647, 2023.
  9. WAIC, but Why? Generative Ensembles for Robust Anomaly Detection, 2019. URL http://arxiv.org/abs/1810.01392.
  10. Conditional Synthetic Data Generation for Robust Machine Learning Applications with Limited Pandemic Data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11792–11800, 2022. URL https://ojs.aaai.org/index.php/AAAI/article/view/21435.
  11. Constrained Generative Adversarial Network Ensembles for Sharable Synthetic Data Generation, 2020. URL http://arxiv.org/abs/2003.00086.
  12. Retiring Adult: New Datasets for Fair Machine Learning. In Advances in Neural Information Processing Systems, volume 34, 2021. URL https://openreview.net/forum?id=bYi_2708mKK.
  13. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014. URL https://doi.org/10.1561/0400000042.
  14. Calibrating Noise to Sensitivity in Private Data Analysis. In Third Theory of Cryptography Conference, volume 3876 of Lecture Notes in Computer Science, pages 265–284. Springer, 2006. URL https://doi.org/10.1007/11681878_14.
  15. Neural Networks and the Bias/Variance Dilemma. Neural Computation, 4(1):1–58, 1992. URL https://doi.org/10.1162/neco.1992.4.1.1.
  16. Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477):359–378, 2007. URL https://doi.org/10.1198/016214506000001437.
  17. Ensembles of Classifiers: A Bias-Variance Perspective. Transactions on Machine Learning Research, 2022. URL https://openreview.net/forum?id=lIOQFVncY9.
  18. DP-MERF: Differentially Private Mean Embeddings with RandomFeatures for Practical Privacy-preserving Data Generation. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1819–1827. PMLR, 2021. URL http://proceedings.mlr.press/v130/harder21a.html.
  19. A Simple and Practical Algorithm for Differentially Private Data Release. In Advances in Neural Information Processing Systems, volume 25, pages 2348–2356, 2012. URL https://proceedings.neurips.cc/paper/2012/hash/208e43f0e45c4c78cafadb83d2888cb6-Abstract.html.
  20. Hans Hofmann. Statlog (German credit data). UCI Machine Learning Repository, 1994. URL https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data.
  21. Differentially Private Variational Inference for Non-conjugate Models. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2017. URL http://auai.org/uai2017/proceedings/papers/152.pdf.
  22. Privacy-preserving data sharing via probabilistic modeling. Patterns, 2(7):100271, 2021. URL https://linkinghub.elsevier.com/retrieve/pii/S2666389921000970.
  23. Gareth M. James. Variance and Bias for General Loss Functions. Machine Learning, 51(2):115–135, 2003. URL https://doi.org/10.1023/A:1022899518027.
  24. Proper Losses for Discrete Generative Models. In Proceedings of the 40th International Conference on Machine Learning, 2023. URL https://openreview.net/forum?id=VVdb1la0cW.
  25. Adult. UCI Machine Learning Repository, 1996. URL https://archive.ics.uci.edu/dataset/2/adult.
  26. TabDDPM: Modelling Tabular Data with Diffusion Models. In Proceedings of the 40th International Conference on Machine Learning, 2023. URL https://openreview.net/forum?id=hTzPqLKBJY.
  27. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Advances in Neural Information Processing Systems, volume 30, 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/hash/9ef2ed4b7fd2c810847ffa5fa85bce38-Abstract.html.
  28. A data distortion by probability distribution. ACM Transactions on Database Systems, 10(3):395–411, 1985. URL https://dl.acm.org/doi/10.1145/3979.4017.
  29. Iterative Methods for Private Synthetic Data: Unifying Framework and New Methods. In Advances in Neural Information Processing Systems, volume 34, pages 690–702, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/0678c572b0d5597d2d4a6b5bd135754c-Abstract.html.
  30. Ensembles of Generative Adversarial Networks for Disconnected Data, 2020. URL http://arxiv.org/abs/2006.14600.
  31. Graphical-model based estimation and inference for differential privacy. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4435–4444. PMLR, 2019. URL http://proceedings.mlr.press/v97/mckenna19a.html.
  32. Winning the NIST Contest: A scalable and general approach to differentially private synthetic data. Journal of Privacy and Confidentiality, 11(3), 2021. URL https://journalprivacyconfidentiality.org/index.php/jpc/article/view/778.
  33. Achilles’ Heels: Vulnerable Record Identification in Synthetic Data Publishing. In 28th European Symposium on Research in Computer Security, 2023. URL http://arxiv.org/abs/2306.10308.
  34. Abalone. UCI Machine Learning Repository, 1995. URL https://archive.ics.uci.edu/dataset/1/abalone.
  35. Synthpop: Bespoke Creation of Synthetic Data in R. Journal of Statistical Software, 74:1–26, 2016. URL https://doi.org/10.18637/jss.v074.i11.
  36. David Pfau. A generalized bias-variance decomposition for bregman divergences. 2013. URL http://www.davidpfau.com/assets/generalized_bvd_proof.pdf.
  37. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1):1, 2003.
  38. On Consistent Bayesian Inference from Synthetic Data, 2023a. URL http://arxiv.org/abs/2305.16795.
  39. Noise-aware statistical inference with differentially private synthetic data. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 3620–3643. PMLR, 2023b. URL https://proceedings.mlr.press/v206/raisa23a.html.
  40. Donald B. Rubin. Discussion: Statistical disclosure limitation. Journal of Official Statistics, 9(2):461–468, 1993.
  41. N. Ueda and R. Nakano. Generalization error of ensemble estimators. In Proceedings of International Conference on Neural Networks (ICNN’96), volume 1, pages 90–95 vol.1, 1996.
  42. DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks. In Advances in Neural Information Processing Systems, volume 34, pages 22221–22233, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ba9fab001f67381e56e410575874d967-Abstract.html.
  43. Synthetic data, real errors: How (not) to publish and use synthetic data. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 34793–34808. PMLR, 2023a. URL https://proceedings.mlr.press/v202/van-breugel23a.html.
  44. Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data. In Thirty-Seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=tJ88RBqupo.
  45. Membership Inference Attacks against Synthetic Data through Overfitting Detection. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, pages 3493–3514. PMLR, 2023c. URL https://proceedings.mlr.press/v206/breugel23a.html.
  46. Ensembles of Generative Adversarial Networks, 2016. URL http://arxiv.org/abs/1612.00991.
  47. Bayesian Deep Learning and a Probabilistic Perspective of Generalization. In Advances in Neural Information Processing Systems, volume 33, pages 4697–4708, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/322f62469c5e3c7dc3e58f5a4d1ea399-Abstract.html.
  48. Andrew Gordon Wilson. Deep Ensembles as Approximate Bayesian Inference, 2021. URL https://cims.nyu.edu/~andrewgw/deepensembles/.
  49. Breast cancer Wisconsin (diagnostic). UCI Machine Learning Repository, 1995. URL https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic.
  50. A Unified Theory of Diversity in Ensemble Learning. Journal of Machine Learning Research, 24(359):1–49, 2023. URL http://jmlr.org/papers/v24/23-0041.html.
  51. Modeling Tabular data using Conditional GAN. In Advances in Neural Information Processing Systems, volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html.

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube