SPIRT: A Fault-Tolerant and Reliable Peer-to-Peer Serverless ML Training Architecture (2309.14148v1)
Abstract: The advent of serverless computing has ushered in notable advancements in distributed machine learning, particularly within parameter server-based architectures. Yet, the integration of serverless features within peer-to-peer (P2P) distributed networks remains largely uncharted. In this paper, we introduce SPIRT, a fault-tolerant, reliable, and secure serverless P2P ML training architecture. designed to bridge this existing gap. Capitalizing on the inherent robustness and reliability innate to P2P systems, SPIRT employs RedisAI for in-database operations, leading to an 82\% reduction in the time required for model updates and gradient averaging across a variety of models and batch sizes. This architecture showcases resilience against peer failures and adeptly manages the integration of new peers, thereby highlighting its fault-tolerant characteristics and scalability. Furthermore, SPIRT ensures secure communication between peers, enhancing the reliability of distributed machine learning tasks. Even in the face of Byzantine attacks, the system's robust aggregation algorithms maintain high levels of accuracy. These findings illuminate the promising potential of serverless architectures in P2P distributed machine learning, offering a significant stride towards the development of more efficient, scalable, and resilient applications.
- B. Yuan, C. R. Wolfe, C. Dun, Y. Tang, A. Kyrillidis, and C. Jermaine, “Distributed learning of fully connected neural networks using independent subnet training,” Proc. VLDB Endow., vol. 15, no. 8, p. 1581–1590, apr 2022. [Online]. Available: https://doi.org/10.14778/3529337.3529343
- J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” Acm computing surveys (csur), vol. 53, no. 2, pp. 1–33, 2020.
- S. Alqahtani and M. Demirbas, “Performance analysis and comparison of distributed machine learning systems,” arXiv preprint arXiv:1909.02061, 2019.
- T. Sun, D. Li, and B. Wang, “Decentralized federated averaging,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. Smola, “Parameter server for distributed machine learning,” in Big learning NIPS workshop, vol. 6, no. 2, 2013.
- R. Šajina, N. Tanković, and I. Ipšić, “Peer-to-peer deep learning with non-iid data,” Expert Systems with Applications, vol. 214, p. 119159, 2023.
- “Serverless computing - aws lambda - amazon web services,” https://aws.amazon.com/lambda/, (Accessed on 04/20/2023).
- “Cloud functions | google cloud,” https://cloud.google.com/functions, (Accessed on 01/26/2022).
- “Cloud computing services | microsoft azure,” https://azure.microsoft.com/en-us/, (Accessed on 01/26/2022).
- A. Barrak, F. Petrillo, and F. Jaafar, “Serverless on machine learning: A systematic mapping study,” IEEE Access, vol. 10, pp. 99 337–99 352, 2022.
- Y. Yang, L. Zhao, Y. Li, H. Zhang, J. Li, M. Zhao, X. Chen, and K. Li, “Infless: a native serverless system for low-latency, high-throughput inference,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022, pp. 768–781.
- A. Bhattacharjee, Y. Barve, S. Khare, S. Bao, A. Gokhale, and T. Damiano, “Stratum: A serverless framework for the lifecycle management of machine learning-based data analytics tasks,” in 2019 USENIX Conference on Operational Machine Learning (OpML 19). Santa Clara, CA: USENIX Association, May 2019, pp. 59–61. [Online]. Available: https://www.usenix.org/conference/opml19/presentation/bhattacharjee
- P. G. Sarroca and M. Sánchez-Artigas, “Mlless: Achieving cost efficiency in serverless machine learning training,” arXiv preprint arXiv:2206.05786, 2022.
- J. Jiang, S. Gan, Y. Liu, F. Wang, G. Alonso, A. Klimovic, A. Singla, W. Wu, and C. Zhang, “Towards demystifying serverless machine learning training,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 857–871.
- D. Chahal, M. Mishra, S. C. Palepu, R. K. Singh, and R. Singhal, “Pay-as-you-train: Efficient ways of serverless training,” in 2022 IEEE International Conference on Cloud Engineering (IC2E). IEEE, 2022, pp. 116–125.
- A. Ali, S. Zawad, P. Aditya, I. E. Akkus, R. Chen, and F. Yan, “Smlt: A serverless framework for scalable and adaptive machine learning design and training,” arXiv preprint arXiv:2205.01853, 2022.
- A. Grafberger, M. Chadha, A. Jindal, J. Gu, and M. Gerndt, “Fedless: Secure and scalable federated learning using serverless computing,” arXiv preprint arXiv:2111.03396, 2021.
- J. Sampé, G. Vernik, M. Sánchez-Artigas, and P. García-López, “Serverless data analytics in the ibm cloud,” in Proceedings of the 19th International Middleware Conference Industry, 2018, pp. 1–8.
- A. Barrak, R. Trabelssi, F. Jaafar, and F. Petrillo, “Exploring the impact of serverless computing on peer to peer training machine learning,” International Conference on Cloud Engineering, 2023.
- H. Wang, L. Muñoz-González, M. Z. Hameed, D. Eklund, and S. Raza, “Sparsfa: Towards robust and communication-efficient peer-to-peer federated learning,” Computers & Security, vol. 129, p. 103182, 2023.
- M. Xu, Z. Zou, Y. Cheng, Q. Hu, D. Yu, and X. Cheng, “Spdl: A blockchain-enabled secure and privacy-preserving decentralized learning system,” IEEE Transactions on Computers, 2022.
- M. Shayan, C. Fung, C. J. Yoon, and I. Beschastnikh, “Biscotti: A ledger for private and secure peer-to-peer machine learning,” arXiv preprint arXiv:1811.09904, 2018.
- R. Guerraoui, A. Guirguis, J. Plassmann, A. Ragot, and S. Rouault, “Garfield: System support for byzantine machine learning (regular paper),” in 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 2021, pp. 39–51.
- C. Fang, Z. Yang, and W. U. Bajwa, “Bridge: Byzantine-resilient decentralized gradient descent,” arXiv preprint arXiv:1908.08098, 2019.
- P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer, “Machine learning with adversaries: Byzantine tolerant gradient descent,” Advances in neural information processing systems, vol. 30, 2017.
- C. Xie, O. Koyejo, and I. Gupta, “Generalized byzantine-tolerant sgd,” arXiv preprint arXiv:1802.10116, 2018.
- C. Xie, S. Koyejo, and I. Gupta, “Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance,” in International Conference on Machine Learning. PMLR, 2019, pp. 6893–6901.
- E. Haußmann, “Accelerating i/o bound deep learning on shared storage,” 2018.
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica et al., “Spark: Cluster computing with working sets.” HotCloud, vol. 10, no. 10-10, p. 95, 2010.
- P. Baran, “On distributed communications networks,” IEEE transactions on Communications Systems, vol. 12, no. 1, pp. 1–9, 1964.
- A. Agarwal, O. Chapelle, M. Dudík, and J. Langford, “A reliable effective terascale linear learning system,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1111–1133, 2014.
- J. Wei, W. Dai, A. Qiao, Q. Ho, H. Cui, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing, “Managed communication and consistency for fast data-parallel iterative analytics,” in Proceedings of the Sixth ACM Symposium on Cloud Computing, 2015, pp. 381–394.
- N. Shahidi, J. R. Gunasekaran, and M. T. Kandemir, “Cross-platform performance evaluation of stateful serverless workflows,” in 2021 IEEE International Symposium on Workload Characterization (IISWC), 2021, pp. 63–73.
- D. Barcelona-Pons, P. Sutra, M. Sánchez-Artigas, G. París, and P. García-López, “Stateful serverless computing with crucial,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 3, pp. 1–38, 2022.
- F. Xu, Y. Qin, L. Chen, Z. Zhou, and F. Liu, “λ𝜆\lambdaitalic_λ dnn : Achieving predictable distributed dnn training with serverless architectures,” IEEE Transactions on Computers, 2021.
- M. Sánchez-Artigas and P. G. Sarroca, “Experience paper: Towards enhancing cost efficiency in serverless machine learning training,” in Proceedings of the 22nd International Middleware Conference, 2021, pp. 210–222.
- S. S. Sandha, W. Cabrera, M. Al-Kateb, S. Nair, and M. Srivastava, “In-database distributed machine learning: Demonstration using teradata sql engine,” Proc. VLDB Endow., vol. 12, no. 12, p. 1854–1857, aug 2019. [Online]. Available: https://doi.org/10.14778/3352063.3352083
- “Redisai - a server for machine and deep learning models,” https://oss.redis.com/redisai/#quick-links, (Accessed on 06/20/2023).
- “Oracle report scaling r to the enterprise,” https://www.oracle.com/a/otn/docs/bringing-r-to-the-enterprise.pdf, July 2019, (Accessed on 06/20/2023).
- J. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li et al., “The madlib analytics library or mad skills, the sql,” arXiv preprint arXiv:1208.4165, 2012.
- “Cross-region replication in azure | microsoft learn,” https://learn.microsoft.com/en-us/azure/reliability/cross-region-replication-azure, (Accessed on 05/29/2023).
- “Configuring a lambda function to access resources in a vpc - aws lambda,” https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html, (Accessed on 05/29/2023).
- A. Qiao, B. Aragam, B. Zhang, and E. Xing, “Fault tolerance in iterative-convergent machine learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 5220–5230.
- Y. Bouizem, N. Parlavantzas, D. Dib, and C. Morin, “Active-standby for high-availability in faas,” in Proceedings of the 2020 Sixth International Workshop on Serverless Computing, 2020, pp. 31–36.
- T. Wink and Z. Nochta, “An approach for peer-to-peer federated learning,” in 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 2021, pp. 150–157.
- H. Bourreau, E. Guichet, A. Barrak, B. Simon, and F. Jaafar, “On securing the communication in iot infrastructure using elliptic curve cryptography,” in 2022 IEEE 22nd International Conference on Software Quality, Reliability, and Security Companion (QRS-C). IEEE, 2022, pp. 758–759.
- L. Lamport, R. Shostak, and M. Pease, “The byzantine generals problem,” in Concurrency: the works of leslie lamport, 2019, pp. 203–226.
- “Lambda quotas - aws lambda,” https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html, (Accessed on 06/24/2023).
- L. Deng, “The mnist database of handwritten digit images for machine learning research [best of the web],” IEEE signal processing magazine, vol. 29, no. 6, pp. 141–142, 2012.
- B. Koonce and B. Koonce, “Mobilenetv3,” Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization, pp. 125–144, 2021.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
- L. Li, W. Xu, T. Chen, G. B. Giannakis, and Q. Ling, “Rsa: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 1544–1551.
- S. Li, Y. Cheng, W. Wang, Y. Liu, and T. Chen, “Learning to detect malicious clients for robust federated learning,” arXiv preprint arXiv:2002.00211, 2020.