Enhancing Data Provenance and Model Transparency in Federated Learning Systems -- A Database Approach (2403.01451v1)
Abstract: Federated Learning (FL) presents a promising paradigm for training machine learning models across decentralized edge devices while preserving data privacy. Ensuring the integrity and traceability of data across these distributed environments, however, remains a critical challenge. The ability to create transparent artificial intelligence, such as detailing the training process of a machine learning model, has become an increasingly prominent concern due to the large number of sensitive (hyper)parameters it utilizes; thus, it is imperative to strike a reasonable balance between openness and the need to protect sensitive information. In this paper, we propose one of the first approaches to enhance data provenance and model transparency in federated learning systems. Our methodology leverages a combination of cryptographic techniques and efficient model management to track the transformation of data throughout the FL process, and seeks to increase the reproducibility and trustworthiness of a trained FL model. We demonstrate the effectiveness of our approach through experimental evaluations on diverse FL scenarios, showcasing its ability to tackle accountability and explainability across the board. Our findings show that our system can greatly enhance data transparency in various FL environments by storing chained cryptographic hashes and client model snapshots in our proposed design for data decoupled FL. This is made possible by also employing multiple optimization techniques which enables comprehensive data provenance without imposing substantial computational loads. Extensive experimental results suggest that integrating a database subsystem into federated learning systems can improve data provenance in an efficient manner, encouraging secure FL adoption in privacy-sensitive applications and paving the way for future advancements in FL transparency and security features.
- 2018. https://oag.ca.gov/privacy/ccpa
- 2021. https://lis.virginia.gov/cgi-bin/legp604.exe?212%2Bsum%2BHB2307
- Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS’16). ACM. https://doi.org/10.1145/2976749.2978318
- How To Backdoor Federated Learning. CoRR abs/1807.00459 (2018). arXiv:1807.00459 http://arxiv.org/abs/1807.00459
- Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/f4b9ec30ad9f68f89b29639786cb62ef-Paper.pdf
- Practical Secure Aggregation for Privacy-Preserving Machine Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (Dallas, Texas, USA) (CCS ’17). Association for Computing Machinery, New York, NY, USA, 1175–1191. https://doi.org/10.1145/3133956.3133982
- Federated learning of predictive models from federated Electronic Health Records. International journal of medical informatics 112 (2018), 59–67. https://api.semanticscholar.org/CorpusID:3679574
- Asynchronous Online Federated Learning for Edge Devices with Non-IID Data. 15–24. https://doi.org/10.1109/BigData50022.2020.9378161
- SecureBoost: A Lossless Federated Learning Framework. IEEE Intelligent Systems 36, 6 (2021), 87–98. https://doi.org/10.1109/MIS.2021.3082561
- Fine-grained Concept Linking using Neural Networks in Healthcare. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, Gautam Das 0001, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 51–66. https://doi.org/10.1145/3183713.3196907
- Li Deng. 2012. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29, 6 (2012), 141–142.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV]
- The design and operation of cloudlab. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (Renton, WA, USA) (USENIX ATC ’19). USENIX Association, USA, 1–14.
- Caroline Fontaine and Galand Fabien. 2007. A Survey of Homomorphic Encryption for Nonspecialists. EURASIP Journal on Information Security 2007 (01 2007). https://doi.org/10.1155/2007/13801
- VF2Boost: Very Fast Vertical Federated Gradient Boosting for Cross-Enterprise Learning. Proceedings of the 2021 International Conference on Management of Data (2021). https://api.semanticscholar.org/CorpusID:235474191
- BlindFL: Vertical Federated Machine Learning without Peeking into Your Data. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD/PODS ’22). ACM. https://doi.org/10.1145/3514221.3526127
- Inverting Gradients - How easy is it to break privacy in federated learning?. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 16937–16947. https://proceedings.neurips.cc/paper_files/paper/2020/file/c4ede56bbd98819ae6112b20ac6bf145-Paper.pdf
- CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy. In Proceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research), Maria Florina Balcan and Kilian Q. Weinberger (Eds.), Vol. 48. PMLR, New York, New York, USA, 201–210. https://proceedings.mlr.press/v48/gilad-bachrach16.html
- Federated Learning for Mobile Keyboard Prediction. CoRR abs/1811.03604 (2018). arXiv:1811.03604 http://arxiv.org/abs/1811.03604
- Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]
- CryptoDL: Deep Neural Networks over Encrypted Data. arXiv:1711.05189 [cs.CR]
- Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning. CoRR abs/1702.07464 (2017). arXiv:1702.07464 http://arxiv.org/abs/1702.07464
- Alex Ingerman and Krzys Ostrowski. 2019. Introducing Tensorflow Federated. https://blog.tensorflow.org/2019/03/introducing-tensorflow-federated.html
- Advances and Open Problems in Federated Learning. CoRR abs/1912.04977 (2019). arXiv:1912.04977 http://arxiv.org/abs/1912.04977
- Federated Learning: Strategies for Improving Communication Efficiency. arXiv:1610.05492 [cs.LG]
- CIFAR-10 (Canadian Institute for Advanced Research). https://www.cs.toronto.edu/~kriz/cifar.html
- Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine 37, 3 (May 2020), 50–60. https://doi.org/10.1109/msp.2020.2975749
- Federated matrix factorization with privacy guarantee. Proc. VLDB Endow. 15, 4 (dec 2021), 900–913. https://doi.org/10.14778/3503585.3503598
- ProvChain: A Blockchain-Based Data Provenance Architecture in Cloud Environment with Enhanced Privacy and Availability. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE Computer Society, Los Alamitos, CA, USA, 468–477. https://doi.org/10.1109/CCGRID.2017.8
- Projected federated averaging with heterogeneous differential privacy. Proc. VLDB Endow. 15, 4 (dec 2021), 828–840. https://doi.org/10.14778/3503585.3503592
- Enabling SQL-based Training Data Debugging for Federated Learning. CoRR abs/2108.11884 (2021). arXiv:2108.11884 https://arxiv.org/abs/2108.11884
- Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).
- Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA (Proceedings of Machine Learning Research), Aarti Singh and Xiaojin (Jerry) Zhu (Eds.), Vol. 54. PMLR, 1273–1282. http://proceedings.mlr.press/v54/mcmahan17a.html
- Payman Mohassel and Peter Rindal. 2018. ABY3: A Mixed Protocol Framework for Machine Learning. IACR Cryptol. ePrint Arch. (2018), 403. https://eprint.iacr.org/2018/403
- Payman Mohassel and Yupeng Zhang. 2017. SecureML: A System for Scalable Privacy-Preserving Machine Learning. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017. IEEE Computer Society, 19–38. https://doi.org/10.1109/SP.2017.12
- Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE. https://doi.org/10.1109/sp.2019.00065
- Data Management Challenges in Production Machine Learning. In Proceedings of the 2017 ACM International Conference on Management of Data. New York, NY, USA, 1723–1726.
- OpenFL: An open-source framework for Federated Learning. CoRR abs/2105.06413 (2021). arXiv:2105.06413 https://arxiv.org/abs/2105.06413
- Ronald L. Rivest and Michael L. Dertouzos. 1978. ON DATA BANKS AND PRIVACY HOMOMORPHISMS. https://api.semanticscholar.org/CorpusID:6905087
- Federated Optimization in Heterogeneous Networks. arXiv: Learning (2018). https://api.semanticscholar.org/CorpusID:59316566
- Differentially Private ADMM Algorithms for Machine Learning. Trans. Info. For. Sec. 16 (jan 2021), 4733–4745. https://doi.org/10.1109/TIFS.2021.3113768
- Toward Personalized Federated Learning. IEEE Transactions on Neural Networks and Learning Systems PP (03 2022), 1–17. https://doi.org/10.1109/TNNLS.2022.3160699
- Paul Voigt and Axel Bussche. 2017. The EU General Data Protection Regulation (GDPR): A Practical Guide. https://doi.org/10.1007/978-3-319-57959-7
- Vertical Federated Learning: Challenges, Methodologies and Experiments. arXiv:2202.04309 [cs.LG]
- Establishing Data Provenance for Responsible Artificial Intelligence Systems. ACM Trans. Manage. Inf. Syst. 13, 2, Article 22 (mar 2022), 23 pages. https://doi.org/10.1145/3503488
- Practical Differentially Private and Byzantine-resilient Federated Learning. Proc. ACM Manag. Data 1, 2, Article 119 (jun 2023), 26 pages. https://doi.org/10.1145/3589264
- Federated Learning for Healthcare Informatics. Journal of Healthcare Informatics Research 5 (03 2021), 1–19. https://doi.org/10.1007/s41666-020-00082-4
- Federated Machine Learning: Concept and Applications. arXiv:1902.04885 [cs.AI]
- Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, 5650–5659. https://proceedings.mlr.press/v80/yin18a.html
- See through Gradients: Image Batch Recovery via GradInversion. arXiv:2104.07586 [cs.LG]
- TRACER: A Framework for Facilitating Accurate and Interpretable Analytics for High Stakes Applications. arXiv:2003.12012 [eess.SP]
- PySyft: A Library for Easy Federated Learning. https://api.semanticscholar.org/CorpusID:236690571
- Michael Gu (2 papers)
- Ramasoumya Naraparaju (1 paper)
- Dongfang Zhao (56 papers)