Papers
Topics
Authors
Recent
2000 character limit reached

Trained Random Forests Completely Reveal your Dataset (2402.19232v2)

Published 29 Feb 2024 in cs.LG and cs.CR

Abstract: We introduce an optimization-based reconstruction attack capable of completely or near-completely reconstructing a dataset utilized for training a random forest. Notably, our approach relies solely on information readily available in commonly used libraries such as scikit-learn. To achieve this, we formulate the reconstruction problem as a combinatorial problem under a maximum likelihood objective. We demonstrate that this problem is NP-hard, though solvable at scale using constraint programming -- an approach rooted in constraint propagation and solution-domain reduction. Through an extensive computational investigation, we demonstrate that random forests trained without bootstrap aggregation but with feature randomization are susceptible to a complete reconstruction. This holds true even with a small number of trees. Even with bootstrap aggregation, the majority of the data can also be reconstructed. These findings underscore a critical vulnerability inherent in widely adopted ensemble methods, warranting attention and mitigation. Although the potential for such reconstruction attacks has been discussed in privacy research, our study provides clear empirical evidence of their practicability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Dikaios: Privacy auditing of algorithmic fairness via attribute inference attacks. arXiv preprint arXiv:2202.02242, 2022.
  2. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. propublica (2016). ProPublica, May, 23, 2016.
  3. Optimal kidney exchange with immunosuppressants. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp.  21–29. AAAI Press, 2021. doi: 10.1609/AAAI.V35I1.16073. URL https://doi.org/10.1609/aaai.v35i1.16073.
  4. Membership inference attacks from first principles. In 43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022, pp.  1897–1914. IEEE, 2022. doi: 10.1109/SP46214.2022.9833649. URL https://doi.org/10.1109/SP46214.2022.9833649.
  5. Cristofaro, E. D. An overview of privacy in machine learning. CoRR, abs/2005.08679, 2020. URL https://arxiv.org/abs/2005.08679.
  6. Revealing information while preserving privacy. In Neven, F., Beeri, C., and Milo, T. (eds.), Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 9-12, 2003, San Diego, CA, USA, pp.  202–210. ACM, 2003. doi: 10.1145/773153.773173. URL https://doi.org/10.1145/773153.773173.
  7. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  8. The price of privacy and the limits of lp decoding. In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, STOC ’07, pp.  85–94, New York, NY, USA, 2007. Association for Computing Machinery. ISBN 9781595936318. doi: 10.1145/1250790.1250804. URL https://doi.org/10.1145/1250790.1250804.
  9. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
  10. Exposed! a survey of attacks on private data. Annual Review of Statistics and Its Application, 4(1):61–84, 2017. doi: 10.1146/annurev-statistics-060116-054123. URL https://doi.org/10.1146/annurev-statistics-060116-054123.
  11. Exploiting fairness to enhance sensitive attributes reconstruction. In First IEEE Conference on Secure and Trustworthy Machine Learning, 2023. URL https://openreview.net/forum?id=tOVr0HLaFz0.
  12. Probabilistic Dataset Reconstruction from Interpretable Models. In 2nd IEEE Conference on Secure and Trustworthy Machine Learning, Toronto, Canada, April 2024. URL https://hal.science/hal-04189566.
  13. Decision tree classification with differential privacy: A survey. ACM Comput. Surv., 52(4), aug 2019. ISSN 0360-0300. doi: 10.1145/3337064. URL https://doi.org/10.1145/3337064.
  14. Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In Fu, K. and Jung, J. (eds.), Proceedings of the 23rd USENIX Security Symposium, San Diego, CA, USA, August 20-22, 2014, pp.  17–32. USENIX Association, 2014. URL https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/fredrikson_matthew.
  15. Model inversion attacks that exploit confidence information and basic countermeasures. In Ray, I., Li, N., and Kruegel, C. (eds.), Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, October 12-16, 2015, pp.  1322–1333. ACM, 2015. doi: 10.1145/2810103.2813677. URL https://doi.org/10.1145/2810103.2813677.
  16. Reconstruction attack through classifier analysis. In Cuppens-Boulahia, N., Cuppens, F., and García-Alfaro, J. (eds.), Data and Applications Security and Privacy XXVI - 26th Annual IFIP WG 11.3 Conference, DBSec 2012, Paris, France, July 11-13,2012. Proceedings, volume 7371 of Lecture Notes in Computer Science, pp.  274–281. Springer, 2012. doi: 10.1007/978-3-642-31540-4_21. URL https://doi.org/10.1007/978-3-642-31540-4_21.
  17. Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.
  18. Can querying for bias leak protected attributes? achieving privacy with smooth sensitivity. In NeurIPS 2022 Workshop on Algorithmic Fairness through the Lens of Causality and Privacy, 2022.
  19. Inference attack and defense on the distributed private fair learning framework. In The AAAI Workshop on Privacy-Preserving Artificial Intelligence, 2020.
  20. When machine learning meets privacy: A survey and outlook. ACM Computing Surveys (CSUR), 54(2):1–36, 2021a.
  21. On the intrinsic differential privacy of bagging. In Zhou, Z. (ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pp.  2730–2736. ijcai.org, 2021b. doi: 10.24963/IJCAI.2021/376. URL https://doi.org/10.24963/ijcai.2021/376.
  22. Optimal counterfactual explanations in tree ensembles. In International Conference on Machine Learning, pp. 8422–8431. PMLR, 2021.
  23. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  24. CP-SAT. URL https://developers.google.com/optimization/cp/cp_solver/.
  25. Privacy-preserving deep learning: Revisited and enhanced. In Batten, L., Kim, D. S., Zhang, X., and Li, G. (eds.), Applications and Techniques in Information Security - 8th International Conference, ATIS 2017, Auckland, New Zealand, July 6-7, 2017, Proceedings, volume 719 of Communications in Computer and Information Science, pp. 100–110. Springer, 2017. doi: 10.1007/978-981-10-5421-1_9. URL https://doi.org/10.1007/978-981-10-5421-1_9.
  26. A survey of privacy attacks in machine learning. CoRR, abs/2007.07646, 2020. URL https://arxiv.org/abs/2007.07646.
  27. Constraint programming. Foundations of Artificial Intelligence, 3:181–211, 2008.
  28. Updates-leak: Data set inference and reconstruction attacks in online learning. In Capkun, S. and Roesner, F. (eds.), 29th USENIX Security Symposium, USENIX Security 2020, August 12-14, 2020, pp.  1291–1308. USENIX Association, 2020. URL https://www.usenix.org/conference/usenixsecurity20/presentation/salem.
  29. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pp.  3–18. IEEE Computer Society, 2017. doi: 10.1109/SP.2017.41. URL https://doi.org/10.1109/SP.2017.41.
  30. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
  31. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2, Part 1):2473–2480, 2009. ISSN 0957-4174. doi: 10.1016/j.eswa.2007.12.020.
  32. Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, 1st edition, 2012. ISBN 1439830037.
Citations (3)

Summary

  • The paper demonstrates that an optimization-based attack can nearly reconstruct a dataset from trained random forests, revealing a critical privacy vulnerability.
  • It formulates the reconstruction as an NP-hard combinatorial problem and solves it in practice using advanced constraint programming techniques.
  • Empirical results on real-world datasets show that even minimal tree ensembles without proper bagging can expose almost complete training data.

Overview of the Paper "Trained Random Forests Completely Reveal your Dataset"

This paper presents a detailed exploration into the vulnerability of Random Forests (RFs) regarding the privacy of their training datasets. The authors propose an optimization-based attack that can nearly reconstruct a dataset used for training a RF, employing only information accessible from libraries such as scikit-learn. The attack is formulated as a combinatorial optimization problem under a maximum likelihood objective, proved to be NP-hard, but solvable in practice using constraint programming.

Key Contributions

  1. Reconstruction Formulation: The paper formulates the reconstruction attack as a combinatorial problem over a maximum likelihood objective aimed at reconstructing datasets through RFs. The problem is defined within the bounds of NP-hard tasks, recognizable in theoretical computer science for their complexity and computational intensity. The paper's approach utilizes constraint programming by leveraging the capability of standard libraries to reveal information about the trained model.
  2. Vulnerability Detection: Examining RFs trained without bootstrap aggregation—highlighting scenarios where even a small number of trees can nearly completely reconstruct datasets—presents a serious vulnerability. The susceptibility of RFs, even when trained with methods traditionally considered privacy-preserving like bootstrap aggregation, is highlighted.
  3. Constraint Programming Model: The authors propose a Constraint Programming (CP) based model as a solution approach, using modern CP algorithms that incorporate constraint propagation and solution-domain reduction to efficiently solve the reconstruction problem.
  4. Empirical Validation: Through extensive computational experiments, the paper shows that RFs trained without adequate privacy-awareness techniques expose significant portions of their training data. These findings are demonstrated over popular datasets, exemplifying the practical real-world risks of such exposure.

Experimental Evaluation

The paper rigorously evaluates the proposed method on several well-known datasets: COMPAS, UCI Adult Income, and Default of Credit Card Clients. These datasets, tagged with sensitive real-world contexts (e.g., criminal records, income data, credit history), underwrite the importance of safeguarding data privacy. The experiments consider settings with and without bagging, observing high rates of dataset reconstruction in many configurations, thus validating the exposure risk.

The experimental paper yields critical insights:

  • Without bagging, RFs reveal the totality of the datasets with a minimal number of trees.
  • Bagging does provide some privacy, reducing the reconstructed data completeness, yet a substantial majority of the data can still be reconstructed.
  • Increasing the number of trees and their depth in forests consistently lowers reconstruction error, demonstrating RFs’ transparency in the lack of privacy-preserving mechanisms.

Implications

The findings of the paper emphasize significant implications for the deployment and development of ML models. The realization that trained RFs can leak training data underscores the necessity for robust integration of privacy-preserving mechanisms, such as differential privacy, especially before model deployment. As ML continues to proliferate in sensitive environments, safeguarding these models against inference and reconstruction attacks becomes paramount.

Furthermore, the work suggests an evolving landscape of adversarial capabilities against ML systems, pointing toward the continuous development and application of privacy mechanisms as vital for future ML research and applications. It raises awareness about the inherent vulnerabilities within certain ML algorithms, urging a reevaluation of data privacy protocols across widely-used toolsets.

Future Work

This research opens opportunities for several progressive explorations:

  • Extending the attack methods to other commonly used ensembles and ML models such as Gradient Boosting Machines or neural networks.
  • Analyzing the interaction between advanced privacy-preserving techniques and their efficacy against reconstruction as computational capabilities evolve.
  • Integrating the work's methods into broader frameworks for evaluating and certifying data privacy compliance in ML systems, potentially influencing regulatory standards for data protection.

In conclusion, the paper provides a foundational insight into the critical privacy concerns surrounding RFs. It serves as a call to re-engineer privacy-preserving methodologies, ensuring that ML models are not only powerful but also ethically and securely developed.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 81 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com