Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Data Provenance and Model Transparency in Federated Learning Systems -- A Database Approach (2403.01451v1)

Published 3 Mar 2024 in cs.CR, cs.DB, and cs.LG

Abstract: Federated Learning (FL) presents a promising paradigm for training machine learning models across decentralized edge devices while preserving data privacy. Ensuring the integrity and traceability of data across these distributed environments, however, remains a critical challenge. The ability to create transparent artificial intelligence, such as detailing the training process of a machine learning model, has become an increasingly prominent concern due to the large number of sensitive (hyper)parameters it utilizes; thus, it is imperative to strike a reasonable balance between openness and the need to protect sensitive information. In this paper, we propose one of the first approaches to enhance data provenance and model transparency in federated learning systems. Our methodology leverages a combination of cryptographic techniques and efficient model management to track the transformation of data throughout the FL process, and seeks to increase the reproducibility and trustworthiness of a trained FL model. We demonstrate the effectiveness of our approach through experimental evaluations on diverse FL scenarios, showcasing its ability to tackle accountability and explainability across the board. Our findings show that our system can greatly enhance data transparency in various FL environments by storing chained cryptographic hashes and client model snapshots in our proposed design for data decoupled FL. This is made possible by also employing multiple optimization techniques which enables comprehensive data provenance without imposing substantial computational loads. Extensive experimental results suggest that integrating a database subsystem into federated learning systems can improve data provenance in an efficient manner, encouraging secure FL adoption in privacy-sensitive applications and paving the way for future advancements in FL transparency and security features.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. 2018. https://oag.ca.gov/privacy/ccpa
  2. 2021. https://lis.virginia.gov/cgi-bin/legp604.exe?212%2Bsum%2BHB2307
  3. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS’16). ACM. https://doi.org/10.1145/2976749.2978318
  4. How To Backdoor Federated Learning. CoRR abs/1807.00459 (2018). arXiv:1807.00459 http://arxiv.org/abs/1807.00459
  5. Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/f4b9ec30ad9f68f89b29639786cb62ef-Paper.pdf
  6. Practical Secure Aggregation for Privacy-Preserving Machine Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (Dallas, Texas, USA) (CCS ’17). Association for Computing Machinery, New York, NY, USA, 1175–1191. https://doi.org/10.1145/3133956.3133982
  7. Federated learning of predictive models from federated Electronic Health Records. International journal of medical informatics 112 (2018), 59–67. https://api.semanticscholar.org/CorpusID:3679574
  8. Asynchronous Online Federated Learning for Edge Devices with Non-IID Data. 15–24. https://doi.org/10.1109/BigData50022.2020.9378161
  9. SecureBoost: A Lossless Federated Learning Framework. IEEE Intelligent Systems 36, 6 (2021), 87–98. https://doi.org/10.1109/MIS.2021.3082561
  10. Fine-grained Concept Linking using Neural Networks in Healthcare. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, Gautam Das 0001, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 51–66. https://doi.org/10.1145/3183713.3196907
  11. Li Deng. 2012. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29, 6 (2012), 141–142.
  12. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV]
  13. The design and operation of cloudlab. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (Renton, WA, USA) (USENIX ATC ’19). USENIX Association, USA, 1–14.
  14. Caroline Fontaine and Galand Fabien. 2007. A Survey of Homomorphic Encryption for Nonspecialists. EURASIP Journal on Information Security 2007 (01 2007). https://doi.org/10.1155/2007/13801
  15. VF2Boost: Very Fast Vertical Federated Gradient Boosting for Cross-Enterprise Learning. Proceedings of the 2021 International Conference on Management of Data (2021). https://api.semanticscholar.org/CorpusID:235474191
  16. BlindFL: Vertical Federated Machine Learning without Peeking into Your Data. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD/PODS ’22). ACM. https://doi.org/10.1145/3514221.3526127
  17. Inverting Gradients - How easy is it to break privacy in federated learning?. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 16937–16947. https://proceedings.neurips.cc/paper_files/paper/2020/file/c4ede56bbd98819ae6112b20ac6bf145-Paper.pdf
  18. CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy. In Proceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research), Maria Florina Balcan and Kilian Q. Weinberger (Eds.), Vol. 48. PMLR, New York, New York, USA, 201–210. https://proceedings.mlr.press/v48/gilad-bachrach16.html
  19. Federated Learning for Mobile Keyboard Prediction. CoRR abs/1811.03604 (2018). arXiv:1811.03604 http://arxiv.org/abs/1811.03604
  20. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]
  21. CryptoDL: Deep Neural Networks over Encrypted Data. arXiv:1711.05189 [cs.CR]
  22. Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning. CoRR abs/1702.07464 (2017). arXiv:1702.07464 http://arxiv.org/abs/1702.07464
  23. Alex Ingerman and Krzys Ostrowski. 2019. Introducing Tensorflow Federated. https://blog.tensorflow.org/2019/03/introducing-tensorflow-federated.html
  24. Advances and Open Problems in Federated Learning. CoRR abs/1912.04977 (2019). arXiv:1912.04977 http://arxiv.org/abs/1912.04977
  25. Federated Learning: Strategies for Improving Communication Efficiency. arXiv:1610.05492 [cs.LG]
  26. CIFAR-10 (Canadian Institute for Advanced Research). https://www.cs.toronto.edu/~kriz/cifar.html
  27. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine 37, 3 (May 2020), 50–60. https://doi.org/10.1109/msp.2020.2975749
  28. Federated matrix factorization with privacy guarantee. Proc. VLDB Endow. 15, 4 (dec 2021), 900–913. https://doi.org/10.14778/3503585.3503598
  29. ProvChain: A Blockchain-Based Data Provenance Architecture in Cloud Environment with Enhanced Privacy and Availability. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE Computer Society, Los Alamitos, CA, USA, 468–477. https://doi.org/10.1109/CCGRID.2017.8
  30. Projected federated averaging with heterogeneous differential privacy. Proc. VLDB Endow. 15, 4 (dec 2021), 828–840. https://doi.org/10.14778/3503585.3503592
  31. Enabling SQL-based Training Data Debugging for Federated Learning. CoRR abs/2108.11884 (2021). arXiv:2108.11884 https://arxiv.org/abs/2108.11884
  32. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).
  33. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA (Proceedings of Machine Learning Research), Aarti Singh and Xiaojin (Jerry) Zhu (Eds.), Vol. 54. PMLR, 1273–1282. http://proceedings.mlr.press/v54/mcmahan17a.html
  34. Payman Mohassel and Peter Rindal. 2018. ABY3: A Mixed Protocol Framework for Machine Learning. IACR Cryptol. ePrint Arch. (2018), 403. https://eprint.iacr.org/2018/403
  35. Payman Mohassel and Yupeng Zhang. 2017. SecureML: A System for Scalable Privacy-Preserving Machine Learning. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017. IEEE Computer Society, 19–38. https://doi.org/10.1109/SP.2017.12
  36. Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE. https://doi.org/10.1109/sp.2019.00065
  37. Data Management Challenges in Production Machine Learning. In Proceedings of the 2017 ACM International Conference on Management of Data. New York, NY, USA, 1723–1726.
  38. OpenFL: An open-source framework for Federated Learning. CoRR abs/2105.06413 (2021). arXiv:2105.06413 https://arxiv.org/abs/2105.06413
  39. Ronald L. Rivest and Michael L. Dertouzos. 1978. ON DATA BANKS AND PRIVACY HOMOMORPHISMS. https://api.semanticscholar.org/CorpusID:6905087
  40. Federated Optimization in Heterogeneous Networks. arXiv: Learning (2018). https://api.semanticscholar.org/CorpusID:59316566
  41. Differentially Private ADMM Algorithms for Machine Learning. Trans. Info. For. Sec. 16 (jan 2021), 4733–4745. https://doi.org/10.1109/TIFS.2021.3113768
  42. Toward Personalized Federated Learning. IEEE Transactions on Neural Networks and Learning Systems PP (03 2022), 1–17. https://doi.org/10.1109/TNNLS.2022.3160699
  43. Paul Voigt and Axel Bussche. 2017. The EU General Data Protection Regulation (GDPR): A Practical Guide. https://doi.org/10.1007/978-3-319-57959-7
  44. Vertical Federated Learning: Challenges, Methodologies and Experiments. arXiv:2202.04309 [cs.LG]
  45. Establishing Data Provenance for Responsible Artificial Intelligence Systems. ACM Trans. Manage. Inf. Syst. 13, 2, Article 22 (mar 2022), 23 pages. https://doi.org/10.1145/3503488
  46. Practical Differentially Private and Byzantine-resilient Federated Learning. Proc. ACM Manag. Data 1, 2, Article 119 (jun 2023), 26 pages. https://doi.org/10.1145/3589264
  47. Federated Learning for Healthcare Informatics. Journal of Healthcare Informatics Research 5 (03 2021), 1–19. https://doi.org/10.1007/s41666-020-00082-4
  48. Federated Machine Learning: Concept and Applications. arXiv:1902.04885 [cs.AI]
  49. Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, 5650–5659. https://proceedings.mlr.press/v80/yin18a.html
  50. See through Gradients: Image Batch Recovery via GradInversion. arXiv:2104.07586 [cs.LG]
  51. TRACER: A Framework for Facilitating Accurate and Interpretable Analytics for High Stakes Applications. arXiv:2003.12012 [eess.SP]
  52. PySyft: A Library for Easy Federated Learning. https://api.semanticscholar.org/CorpusID:236690571
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Michael Gu (2 papers)
  2. Ramasoumya Naraparaju (1 paper)
  3. Dongfang Zhao (56 papers)