Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exploring the Impact of Serverless Computing on Peer To Peer Training Machine Learning

Published 25 Sep 2023 in cs.DC and cs.AI | (2309.14139v1)

Abstract: The increasing demand for computational power in big data and machine learning has driven the development of distributed training methodologies. Among these, peer-to-peer (P2P) networks provide advantages such as enhanced scalability and fault tolerance. However, they also encounter challenges related to resource consumption, costs, and communication overhead as the number of participating peers grows. In this paper, we introduce a novel architecture that combines serverless computing with P2P networks for distributed training and present a method for efficient parallel gradient computation under resource constraints. Our findings show a significant enhancement in gradient computation time, with up to a 97.34\% improvement compared to conventional P2P distributed training methods. As for costs, our examination confirmed that the serverless architecture could incur higher expenses, reaching up to 5.4 times more than instance-based architectures. It is essential to consider that these higher costs are associated with marked improvements in computation time, particularly under resource-constrained scenarios. Despite the cost-time trade-off, the serverless approach still holds promise due to its pay-as-you-go model. Utilizing dynamic resource allocation, it enables faster training times and optimized resource utilization, making it a promising candidate for a wide range of machine learning applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. I. B. Data and D. A. S. Cities, “The exponential growth of data,” Inside Big Data White paper. Retrieved online at https://insidebigdata. com/2017/02/16/the-exponential-growth-of-data, 2017.
  2. J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” Acm computing surveys (csur), vol. 53, no. 2, pp. 1–33, 2020.
  3. B. Yuan, C. R. Wolfe, C. Dun, Y. Tang, A. Kyrillidis, and C. Jermaine, “Distributed learning of fully connected neural networks using independent subnet training,” Proc. VLDB Endow., vol. 15, no. 8, p. 1581–1590, apr 2022. [Online]. Available: https://doi.org/10.14778/3529337.3529343
  4. R. Guerraoui, A. Guirguis, J. Plassmann, A. Ragot, and S. Rouault, “Garfield: System support for byzantine machine learning (regular paper),” in 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).   IEEE, 2021, pp. 39–51.
  5. M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. Smola, “Parameter server for distributed machine learning,” in Big learning NIPS workshop, vol. 6, no. 2, 2013.
  6. M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in 11th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 14), 2014, pp. 583–598.
  7. Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” Advances in neural information processing systems, vol. 26, 2013.
  8. I. Foster and A. Iamnitchi, “On death, taxes, and the convergence of peer-to-peer and grid computing,” in Peer-to-Peer Systems II: Second International Workshop, IPTPS 2003, Berkeley, CA, USA, February 21-22, 2003. Revised Papers 2.   Citeseer, 2003, pp. 118–128.
  9. A. G. Roy, S. Siddiqui, S. Pölsterl, N. Navab, and C. Wachinger, “Braintorrent: A peer-to-peer environment for decentralized federated learning,” arXiv preprint arXiv:1905.06731, 2019.
  10. A. Bellet, R. Guerraoui, M. Taziki, and M. Tommasi, “Personalized and private peer-to-peer machine learning,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2018, pp. 473–481.
  11. T. Wink and Z. Nochta, “An approach for peer-to-peer federated learning,” in 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).   IEEE, 2021, pp. 150–157.
  12. Z. Tang, S. Shi, and X. Chu, “Communication-efficient decentralized learning with sparsification and adaptive peer selection,” in 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS).   IEEE, 2020, pp. 1207–1208.
  13. A. Ali, S. Zawad, P. Aditya, I. E. Akkus, R. Chen, and F. Yan, “Smlt: A serverless framework for scalable and adaptive machine learning design and training,” arXiv preprint arXiv:2205.01853, 2022.
  14. H. Wang, D. Niu, and B. Li, “Distributed machine learning with a serverless architecture,” in IEEE INFOCOM 2019-IEEE Conference on Computer Communications.   IEEE, 2019, pp. 1288–1296.
  15. J. Carreira, P. Fonseca, A. Tumanov, A. Zhang, and R. Katz, “Cirrus: A serverless framework for end-to-end ml workflows,” in Proceedings of the ACM Symposium on Cloud Computing, 2019, pp. 13–24.
  16. J. Jiang, S. Gan, Y. Liu, F. Wang, G. Alonso, A. Klimovic, A. Singla, W. Wu, and C. Zhang, “Towards demystifying serverless machine learning training,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 857–871.
  17. A. Bhattacharjee, Y. Barve, S. Khare, S. Bao, A. Gokhale, and T. Damiano, “Stratum: A serverless framework for the lifecycle management of machine learning-based data analytics tasks,” in 2019 USENIX Conference on Operational Machine Learning (OpML 19).   Santa Clara, CA: USENIX Association, May 2019, pp. 59–61. [Online]. Available: https://www.usenix.org/conference/opml19/presentation/bhattacharjee
  18. H. Shafiei, A. Khonsari, and P. Mousavi, “Serverless computing: a survey of opportunities, challenges, and applications,” ACM Computing Surveys, vol. 54, no. 11s, pp. 1–32, 2022.
  19. A. Barrak, F. Petrillo, and F. Jaafar, “Serverless on machine learning: A systematic mapping study,” IEEE Access, vol. 10, pp. 99 337–99 352, 2022.
  20. P. G. Sarroca and M. Sánchez-Artigas, “Mlless: Achieving cost efficiency in serverless machine learning training,” arXiv preprint arXiv:2206.05786, 2022.
  21. D. Barcelona-Pons, P. Sutra, M. Sánchez-Artigas, G. París, and P. García-López, “Stateful serverless computing with crucial,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 3, pp. 1–38, 2022.
  22. A. Grafberger, M. Chadha, A. Jindal, J. Gu, and M. Gerndt, “Fedless: Secure and scalable federated learning using serverless computing,” arXiv preprint arXiv:2111.03396, 2021.
  23. M. Sánchez-Artigas and P. G. Sarroca, “Experience paper: Towards enhancing cost efficiency in serverless machine learning training,” in Proceedings of the 22nd International Middleware Conference, 2021, pp. 210–222.
  24. J. Sampé, G. Vernik, M. Sánchez-Artigas, and P. García-López, “Serverless data analytics in the ibm cloud,” in Proceedings of the 19th International Middleware Conference Industry, 2018, pp. 1–8.
  25. S. Alqahtani and M. Demirbas, “Performance analysis and comparison of distributed machine learning systems,” arXiv preprint arXiv:1909.02061, 2019.
  26. “A survey of federated learning for edge computing: Research problems and solutions,” High-Confidence Computing, vol. 1, no. 1, p. 100008, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S266729522100009X
  27. J. Kepner, V. Gadepally, H. Jananthan, L. Milechin, and S. Samsi, “Sparse deep neural network exact solutions,” in 2018 IEEE High Performance extreme Computing Conference (HPEC).   IEEE, 2018, pp. 1–8.
  28. N. S. Sattar and S. Anfuzzaman, “Data parallel large sparse deep neural network on gpu,” in 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).   IEEE, 2020, pp. 1–9.
  29. “Serverless computing - aws lambda - amazon web services,” https://aws.amazon.com/lambda/, (Accessed on 04/20/2023).
  30. “Lambda quotas - aws lambda,” https://docs.aws.amazon.com/ lambda/latest/dg/gettingstarted-limits.html, (Accessed on 06/24/2023).
  31. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” Advances in neural information processing systems, vol. 30, 2017.
  32. L. Deng, “The mnist database of handwritten digit images for machine learning research [best of the web],” IEEE signal processing magazine, vol. 29, no. 6, pp. 141–142, 2012.
  33. A. Krizhevsky and G. Hinton, “Convolutional deep belief networks on cifar-10,” Unpublished manuscript, vol. 40, no. 7, pp. 1–9, 2010.
  34. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
  35. B. Koonce and B. Koonce, “Mobilenetv3,” Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization, pp. 125–144, 2021.
  36. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  37. X. Ma, M. Qin, F. Sun, Z. Hou, K. Yuan, Y. Xu, Y. Wang, Y.-K. Chen, R. Jin, and Y. Xie, “Effective model sparsification by scheduled grow-and-prune methods,” arXiv preprint arXiv:2106.09857, 2021.
  38. A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan, “On biased compression for distributed learning,” arXiv preprint arXiv:2002.12410, 2020.
  39. P. Zhou, Q. Lin, D. Loghin, B. C. Ooi, Y. Wu, and H. Yu, “Communication-efficient decentralized machine learning over heterogeneous networks,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE).   IEEE, 2021, pp. 384–395.
  40. Z. Jiang, A. Balu, C. Hegde, and S. Sarkar, “Collaborative deep learning in fixed topology networks,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  41. E. P. Xing, Q. Ho, P. Xie, and D. Wei, “Strategies and principles of distributed machine learning on big data,” Engineering, vol. 2, no. 2, pp. 179–195, 2016.
  42. S. Luo, P. Fan, K. Li, H. Xing, L. Luo, and H. Yu, “Fast parameter synchronization for distributed learning with selective multicast,” in ICC 2022-IEEE International Conference on Communications.   IEEE, 2022, pp. 4775–4780.
  43. T. Rausch, W. Hummer, V. Muthusamy, A. Rashed, and S. Dustdar, “Towards a serverless platform for edge AI,” in 2nd USENIX Workshop on Hot Topics in Edge Computing (HotEdge 19).   Renton, WA: USENIX Association, Jul. 2019. [Online]. Available: https://www.usenix.org/conference/hotedge19/presentation/rausch
  44. N. Shahidi, J. R. Gunasekaran, and M. T. Kandemir, “Cross-platform performance evaluation of stateful serverless workflows,” in 2021 IEEE International Symposium on Workload Characterization (IISWC), 2021, pp. 63–73.
  45. F. Xu, Y. Qin, L. Chen, Z. Zhou, and F. Liu, “λ𝜆\lambdaitalic_λ dnn : Achieving predictable distributed dnn training with serverless architectures,” IEEE Transactions on Computers, 2021.
Citations (2)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.