MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms (2404.02445v1)
Abstract: With its elastic power and a pay-as-you-go cost model, the deployment of deep learning inference services (DLISs) on serverless platforms is emerging as a prevalent trend. However, the varying resource requirements of different layers in DL models hinder resource utilization and increase costs, when DLISs are deployed as a single function on serverless platforms. To tackle this problem, we propose a model partitioning framework called MOPAR. This work is based on the two resource usage patterns of DLISs: global differences and local similarity, due to the presence of resource dominant (RD) operators and layer stacking. Considering these patterns, MOPAR adopts a hybrid approach that initially divides the DL model vertically into multiple slices composed of similar layers to improve resource efficiency. Slices containing RD operators are further partitioned into multiple sub-slices, enabling parallel optimization to reduce inference latency. Moreover, MOPAR comprehensively employs data compression and share-memory techniques to offset the additional time introduced by communication between slices. We implement a prototype of MOPAR and evaluate its efficacy using four categories of 12 DL models on OpenFaaS and AWS Lambda. The experiment results show that MOPAR can improve the resource efficiency of DLISs by 27.62\% on average, while reducing latency by about 5.52\%. Furthermore, based on Lambda's pricing, the cost of running DLISs is reduced by about 2.58 $\times$ using MOPAR.
- A. Ali, R. Pinciroli, F. Yan, and E. Smirni, “Batch: machine learning inference serving on serverless platforms with adaptive batching,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, 2020, p. 69.
- A. Bhattacharjee, A. D. Chhokra, Z. Kang, H. Sun, A. Gokhale, and G. Karsai, “BARISTA: Efficient and scalable serverless serving system for deep learning prediction services,” in IEEE International Conference on Cloud Engineering, IC2E, 2019, pp. 23–33.
- L. Chen, G. Zeng, Q. Zhang, X. Chen, and D. Wu, “Question answering over knowledgebase with attention-based LSTM networks and knowledge embeddings,” in 16th IEEE International Conference on Cognitive Informatics & Cognitive Computing, ICCI*CC, 2017, pp. 243–246.
- T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv:1512.01274, 2015.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
- R. Dey and F. M. Salem, “Gate-variants of gated recurrent unit (gru) neural networks,” in 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), 2017, pp. 1597–1600.
- Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “Glm: General language model pretraining with autoregressive blank infilling,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 320–335.
- A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kaufmann, Y. Vigfusson, and J. Mace, “Serving {{\{{DNNs}}\}} like clockwork: Performance predictability from the bottom up,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020, pp. 443–462.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, 2016, pp. 770–778.
- M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graph-structured data,” arXiv:1506.05163, 2015.
- D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, 2021, 2021.
- Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, Y. Fu et al., “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- B. Jeon, L. Cai, P. Srivastava, J. Jiang, X. Ke, Y. Meng, C. Xie, and I. Gupta, “Baechi: fast device placement of machine learning graphs,” in Proceedings of the 11th ACM Symposium on Cloud Computing, 2020, pp. 416–430.
- Z. Jia and E. Witchel, “Nightcore: efficient and scalable serverless computing for latency-sensitive, interactive microservices,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021, pp. 152–166.
- T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
- Z. Li, L. Zheng, Y. Zhong, V. Liu, Y. Sheng, X. Jin, Y. Huang, Z. Chen, H. Zhang, J. E. Gonzalez et al., “{{\{{AlpaServe}}\}}: Statistical multiplexing with model parallelism for deep learning serving,” in 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), 2023, pp. 663–679.
- Z. Li, S. Zhuang, S. Guo, D. Zhuo, H. Zhang, D. Song, and I. Stoica, “Terapipe: Token-level pipeline parallelism for training large-scale language models,” in International Conference on Machine Learning. PMLR, 2021, pp. 6543–6552.
- Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, 2022, pp. 11 966–11 976.
- B. D. Lund and T. Wang, “Chatting about chatgpt: how may ai and gpt impact academia and libraries?” Library Hi Tech News, vol. 40, no. 3, pp. 26–29, 2023.
- X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,” Advances in neural information processing systems, vol. 36, 2024.
- D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al., “Efficient large-scale language model training on gpu clusters using megatron-lm,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
- NVIDIA. Nvidia container toolkit. [Online]. Available: https://github.com/NVIDIA/nvidia-docker
- J. Park, H. Kwon, S. Kim, J. Lee, M. Ha, E. Lim, M. Imani, and Y. Kim, “Quiltnet: efficient deep learning inference on multi-chip accelerators using model partitioning,” in DAC ’22: 59th ACM/IEEE Design Automation Conference, 2022, 2022, pp. 1159–1164.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in neural information processing systems, 2019, pp. 8024–8035.
- M. Polese, L. Bonati, S. D’Oro, S. Basagni, and T. Melodia, “Understanding o-ran: Architecture, interfaces, algorithms, security, and research challenges,” IEEE Communications Surveys & Tutorials, 2023.
- A. Raghavan, A. Chandra, and J. B. Weissman, “Tiera: Towards flexible multi-tiered cloud storage instances,” in Proceedings of the 15th International Middleware Conference, 2014, pp. 1–12.
- O. Rioul and J. C. Magossi, “On shannon’s formula and hartley’s rule: Beyond the mathematical coincidence,” Entropy, vol. 16, no. 9, pp. 4892–4910, 2014.
- J. Schleier-Smith, V. Sreekanti, A. Khandelwal, J. Carreira, N. J. Yadwadkar, R. A. Popa, J. E. Gonzalez, I. Stoica, and D. A. Patterson, “What serverless computing is and should become: the next phase of cloud computing,” Commun. ACM, vol. 64, no. 5, pp. 76–84, 2021.
- K. Seto, H. Nejatollahi, J. An, S. Kang, and N. D. Dutt, “Small memory footprint neural network accelerators,” in 20th International Symposium on Quality Electronic Design, ISQED 2019, 2019, pp. 253–258.
- J. Shao and J. Zhang, “Communication-computation trade-off in resource-constrained edge inference,” IEEE Communications Magazine, vol. 58, no. 12, pp. 20–26, 2020.
- C. Si, W. Yu, P. Zhou, Y. Zhou, X. Wang, and S. Yan, “Inception transformer,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 495–23 509, 2022.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, 2015.
- J. Soifer, J. Li, M. Li, J. Zhu, Y. Li, Y. He, E. Zheng, A. Oltean, M. Mosyak, C. Barnes et al., “Deep learning inference service at microsoft.” in OpML, 2019, pp. 15–17.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
- J. M. Tarnawski, A. Phanishayee, N. Devanur, D. Mahajan, and F. Nina Paravecino, “Efficient algorithms for device placement of dnn graph operators,” Advances in Neural Information Processing Systems, vol. 33, pp. 15 451–15 463, 2020.
- W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere, Y. Zhang, and A. Kewitsch, “{{\{{TopoOpt}}\}}: Co-optimizing network topology and parallelization strategy for distributed training jobs,” in 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 739–767.
- X. Wei, R. Chen, and H. Chen, “Fast rdma-based ordered key-value store using remote learned cache,” in 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, 2020, pp. 117–135.
- Q. Weng, W. Xiao, Y. Yu, W. Wang, C. Wang, J. He, Y. Li, L. Zhang, W. Lin, and Y. Ding, “MLaaS in the wild: Workload analysis and scheduling in large-scale heterogeneous GPU clusters,” in 19th {normal-{\{{USENIX}normal-}\}} Symposium on Networked Systems Design and Implementation ({normal-{\{{NSDI}normal-}\}} 22), 2022.
- Y. Yang, L. Zhao, Y. Li, H. Zhang, J. Li, M. Zhao, X. Chen, and K. Li, “Infless: a native serverless system for low-latency, high-throughput inference,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022, pp. 768–781.
- G. Yeung, D. Borowiec, R. Yang, A. Friday, R. Harper, and P. Garraghan, “Horus: Interference-aware and prediction-based scheduling in deep learning systems,” IEEE Trans. Parallel Distributed Syst., vol. 33, no. 1, pp. 88–100, 2022.
- E. Yu, L. Du, Y. Jin, Z. Wei, and Y. Chang, “Learning semantic textual similarity via topic-informed discrete latent variables,” arXiv preprint arXiv:2211.03616, 2022.
- G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 521–538.
- L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing et al., “Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 559–578.
- Z. Zhou, X. Wei, J. Zhang, and G. Sun, “{{\{{PetS}}\}}: A unified framework for {{\{{Parameter-Efficient}}\}} transformers serving,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 489–504.
- G. Zou, B. Zhang, J. Zheng, Y. Li, and J. Ma, “Maas: Model as a service in cloud computing and cyber-i space,” in 12th IEEE International Conference on Computer and Information Technology, CIT 2012, 2012, pp. 1125–1130.