Graft: Efficient Inference Serving for Hybrid Deep Learning with SLO Guarantees via DNN Re-alignment (2312.10636v1)
Abstract: Deep neural networks (DNNs) have been widely adopted for various mobile inference tasks, yet their ever-increasing computational demands are hindering their deployment on resource-constrained mobile devices. Hybrid deep learning partitions a DNN into two parts and deploys them across the mobile device and a server, aiming to reduce inference latency or prolong battery life of mobile devices. However, such partitioning produces (non-uniform) DNN fragments which are hard to serve efficiently on the server.This paper presents Graft -- an efficient inference serving system for hybrid deep learning with latency service-level objective (SLO) guarantees. Our main insight is to mitigate the non-uniformity by a core concept called DNN re-alignment, allowing multiple heterogeneous DNN fragments to be restructured to share layers. To fully exploit the potential of DNN re-alignment, Graft employs fine-grained GPU resource sharing. Based on that, we propose efficient algorithms for merging, grouping, and re-aligning DNN fragments to maximize request batching opportunities, minimizing resource consumption while guaranteeing the inference latency SLO. We implement a Graft prototype and perform extensive experiments with five types of widely used DNNs and real-world network traces. Our results show that Graft improves resource efficiency by up to 70% compared with the state-of-the-art inference serving systems.
- P. Guo and W. Hu, “Potluck: Cross-application approximate deduplication for computation-intensive mobile applications,” in Proceedings of the 23nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2018.
- Z. Liao, Z. Luo, Q. Huang, L. Zhang, F. Wu, Q. Zhang, and Y. Wang, “Smart: Screen-based gesture recognition on commodity mobile devices,” in Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, ser. MobiCom ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 283–295.
- X. Liu, D. Liu, J. Zhang, T. Gu, and K. Li, “Rfid and camera fusion for recognition of human-object interactions,” in Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, ser. MobiCom ’21. Association for Computing Machinery, 2021, p. 296–308.
- D. Crankshaw, G. Sela, C. Zumar, X. Mo, J. E. Gonzalez, I. Stoica, and A. Tumanov, “Inferline: latency-aware provisioning and scaling for prediction serving pipelines,” in Proceedings of the 11th ACM Symposium on Cloud Computing, 2020.
- T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,” ACM Computing Surveys, vol. 52, no. 4, pp. 65:1–65:43, 2019.
- F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis, “INFaaS: Automated model-less inference serving,” in Proceedings of USENIX Annual Technical Conference, 2021.
- D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system,” in Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation, 2017.
- H. Shen, L. Chen, Y. Jin, L. Zhao, B. Kong, M. Philipose, A. Krishnamurthy, and R. Sundaram, “Nexus: A GPU cluster engine for accelerating DNN-based video analysis,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019.
- S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane, “SPINN: synergistic progressive inference of neural networks over device and cloud,” in Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, 2020.
- A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kaufmann, Y. Vigfusson, and J. Mace, “Serving dnns like clockwork: Performance predictability from the bottom up,” in Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, 2020.
- M. Almeida, S. Laskaridis, S. I. Venieris, I. Leontiadis, and N. D. Lane, “Dyno: Dynamic onloading of deep neural networks from cloud to device,” ACM Transactions on Embedded Computing Systems, vol. 21, no. 6, oct 2022. [Online]. Available: https://doi.org/10.1145/3510831
- M. Almeida, S. Laskaridis, A. Mehrotra, L. Dudziak, I. Leontiadis, and N. D. Lane, “Smart at what cost? characterising mobile deep neural networks in the wild,” ser. IMC ’21. New York, NY, USA: Association for Computing Machinery, 2021.
- V. J. Reddi, D. Kanter, P. Mattson, J. Duke, T. Nguyen, R. Chukka, K. Shiring, K. Tan, M. Charlebois, W. Chou, M. El-Khamy, J. Hong, T. S. John, C. Trinh, M. Buch, M. Mazumder, R. Markovic, T. Atta-fosu, F. Çakir, M. Charkhabi, X. Chen, C. Chiang, D. Dexter, T. Heo, G. Schmuelling, M. Shabani, and D. Zika, “Mlperf mobile inference benchmark: An industry-standard open-source machine learning benchmark for on-device AI,” in Proceedings of Machine Learning and Systems 2022, MLSys 2022, Santa Clara, CA, USA, August 29 - September 1, 2022, D. Marculescu, Y. Chi, and C. Wu, Eds. mlsys.org, 2022, pp. 352–369.
- A. Ignatov, R. Timofte, A. Kulik, S. Yang, K. Wang, F. Baum, M. Wu, L. Xu, and L. V. Gool, “AI benchmark: All about deep learning on smartphones in 2019,” in 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019.
- H. Cai, J. Lin, Y. Lin, Z. Liu, H. Tang, H. Wang, L. Zhu, and S. Han, “Enable deep learning on mobile devices: Methods, systems, and applications,” ACM Transactions on Design Automation of Electronic Systems, 2022.
- Y. Gong, Z. Jiang, Y. Feng, B. Hu, K. Zhao, Q. Liu, and W. Ou, “Edgerec: Recommender system on edge in mobile taobao,” International Conference on Information and Knowledge Management, Proceedings, pp. 2477–2484, 2020.
- Z. Chen, J. Yao, F. Wang, K. Jia, B. Han, W. Zhang, and H. Yang, “Mc$^2$-sf: Slow-fast learning for mobile-cloud collaborative recommendation,” CoRR, vol. abs/2109.12314, 2021.
- A. Banitalebi-Dehkordi, N. Vedula, J. Pei, F. Xia, L. Wang, and Y. Zhang, “Auto-split: A general framework of collaborative edge-cloud ai,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery; Data Mining, ser. KDD ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 2543–2553. [Online]. Available: https://doi.org/10.1145/3447548.3467078
- https://github.com/pytorch/glow/blob/master/docs/Partitioner.md, 2022.
- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/resnet50_partition.html, 2022.
- https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel.html, 2022.
- https://github.com/msr-fiddle/dnn-partitioning, 2022.
- Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. N. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” in Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2017.
- J. Huang, C. Samplawski, D. Ganesan, B. M. Marlin, and H. Kwon, “CLIO: enabling automatic compilation of deep learning pipelines across iot and cloud,” in MobiCom ’20: The 26th Annual International Conference on Mobile Computing and Networking, London, United Kingdom, September 21-25, 2020. ACM, 2020, pp. 58:1–58:12.
- L. Zhang, L. Chen, and J. Xu, “Autodidactic neurosurgeon: Collaborative deep inference for mobile edge intelligence via online learning,” ser. WWW ’21. New York, NY, USA: Association for Computing Machinery, 2021.
- A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “Jointdnn: An efficient training and inference engine for intelligent mobile cloud computing services,” IEEE Transactions on Mobile Computing, vol. 20, no. 2, pp. 565–576, 2021.
- J. Wu, L. Wang, Q. Pei, X. Cui, F. Liu, and T. Yang, “Hitdl: High-throughput deep learning inference at the hybrid mobile edge,” IEEE Transactions Parallel Distributed System, vol. 33, no. 10, pp. 4499–4514, 2022.
- S. Zhang, Y. Li, X. Liu, S. Guo, W. Wang, J. Wang, B. Ding, and D. Wu, “Towards real-time cooperative deep inference over the cloud and edge end devices,” Proceedings ACM Interactive Mobile Wearable and Ubiquitous Technologies, vol. 4, no. 2, pp. 69:1–69:24, 2020. [Online]. Available: https://doi.org/10.1145/3397315
- F. Mireshghallah, M. Taram, P. Ramrakhyani, A. Jalali, D. M. Tullsen, and H. Esmaeilzadeh, “Shredder: Learning noise distributions to protect inference privacy,” in ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020, J. R. Larus, L. Ceze, and K. Strauss, Eds. ACM, 2020, pp. 3–18.
- M. G. Poirot, P. Vepakomma, K. Chang, J. Kalpathy-Cramer, R. Gupta, and R. Raskar, “Split learning for collaborative deep learning in healthcare,” CoRR, vol. abs/1912.12115, 2019.
- L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
- Y. Li, A. Phanishayee, D. Murray, and N. S. Kim, “Doing more with less: training large DNN models on commodity servers for the masses,” in Proceedings of the 18th Workshop on Hot Topics in Operating Systems, 2021.
- P. with code, “Gflops of image classification targeted models on imagenet,” https://paperswithcode.com/sota/image-classification-on-imagenet?metric=GFLOPs, 2022.
- W. Dong, J. Lv, G. Chen, Y. Wang, H. Li, Y. Gao, and D. Bharadia, “Tinynet: A lightweight, modular, and unified network architecture for the internet of things,” in Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, ser. MobiSys ’22, 2022, p. 248–260.
- S. Jiang, Z. Lin, Y. Li, Y. Shu, and Y. Liu, “Flexible high-resolution object detection on edge devices with tunable latency,” in Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, ser. MobiCom ’21, 2021, p. 559–572.
- H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once-for-all: Train one network and specialize it for efficient deployment,” in Proceedings of the 8th International Conference on Learning Representations, 2020, pp. 1–11.
- T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang, “A systematic DNN weight pruning framework using alternating direction method of multipliers,” in Proceedings of the 15th European Conference on Computer Vision, 2018.
- P. Nayak, D. Zhang, and S. Chai, “Bit efficient quantization for deep neural networks,” in Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition, 2019.
- A. Potapov and S. Rodionov, “Genetic algorithms with dnn-based trainable crossover as an example of partial specialization of general search,” in Proceedings of the 10th International Conference on Artificial General Intelligence, 2017.
- F. Jia, D. Zhang, T. Cao, S. Jiang, Y. Liu, J. Ren, and Y. Zhang, “Codl: Efficient cpu-gpu co-execution for deep learning inference on mobile devices,” in Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, ser. MobiSys ’22, 2022, p. 209–221.
- J. Park, K. Bin, and K. Lee, “Mgemm: Low-latency convolution with minimal memory overhead optimized for mobile devices,” in Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, 2022, p. 222–234.
- M. Wang, S. Ding, T. Cao, Y. Liu, and F. Xu, “Asymo: Scalable and efficient deep-learning inference on asymmetric mobile cpus,” in Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. Association for Computing Machinery, 2021, p. 215–228.
- R. Han, Q. Zhang, C. H. Liu, G. Wang, J. Tang, and L. Y. Chen, “Legodnn: Block-grained scaling of deep neural networks for mobile vision,” in Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, ser. MobiCom ’21, 2021, p. 406–419.
- M. Xu, J. Liu, Y. Liu, F. X. Lin, Y. Liu, and X. Liu, “A first look at deep learning apps on smartphones,” in The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, L. Liu, R. W. White, A. Mantrach, F. Silvestri, J. J. McAuley, R. Baeza-Yates, and L. Zia, Eds. ACM, 2019, pp. 2125–2136. [Online]. Available: https://doi.org/10.1145/3308558.3313591
- B. Lu, J. Yang, W. Jiang, Y. Shi, and S. Ren, “One proxy device is enough for hardware-aware neural architecture search,” Proc. ACM Meas. Anal. Comput. Syst., vol. 5, no. 3, pp. 34:1–34:34, 2021. [Online]. Available: https://doi.org/10.1145/3491046
- C. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. M. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang, “Machine learning at facebook: Understanding inference at the edge,” in 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019, Washington, DC, USA, February 16-20, 2019. IEEE, 2019, pp. 331–344.
- M. Satyanarayanan, “The emergence of edge computing,” Computer, vol. 50, no. 1, pp. 30–39, 2017. [Online]. Available: https://doi.org/10.1109/MC.2017.9
- R. S. Kannan, L. Subramanian, A. Raju, J. Ahn, J. Mars, and L. Tang, “Grandslam: Guaranteeing slas for jobs in microservices execution frameworks,” in Proceedings of the 14th EuroSys Conference, 2019.
- T. Shen, J. Qi, J. Jiang, X. Wang, X. Wen, X. Chen, and S. Zhao, “Soter: Guarding black-box inference for general neural networks at the edge,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad, CA: USENIX Association, Jul. 2022. [Online]. Available: https://www.usenix.org/conference/atc22/presentation/shen
- W. Ju, D. Yuan, W. Bao, L. Ge, and B. B. Zhou, “Deepsave: Saving dnn inference during handovers on the edge,” in Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, 2019.
- C. Hu, W. Bao, D. Wang, and F. Liu, “Dynamic adaptive DNN surgery for inference acceleration on the edge,” in Proceedings of the 38th IEEE International Conference on Computer Communications, 2019.
- J. Emmons, S. Fouladi, G. Ananthanarayanan, S. Venkataraman, S. Savarese, and K. Winstein, “Cracking open the dnn black-box: Video analytics with dnns across the camera-cloud boundary,” ser. HotEdgeVideo’19, 2019.
- https://www.huaweicloud.com/intl/en-us/solution/edgecloud.html, 2022.
- Y. Huang, X. Qiao, S. Dustdar, and Y. Li, “Aodnn: An auto-offloading approach to optimize deep inference for fostering mobile web,” in IEEE INFOCOM 2022 - IEEE Conference on Computer Communications, 2022.
- D. Raca, D. Leahy, C. J. Sreenan, and J. J. Quinlan, “Beyond throughput, the next generation: a 5g dataset with channel and context metrics,” in Proceedings of the 11th ACM Multimedia Systems Conference, 2020.
- https://www.tensorflow.org/tfx/guide/serving#introduction, 2020.
- C. Wan, M. H. Santriaji, E. Rogers, H. Hoffmann, M. Maire, and S. Lu, “ALERT: accurate learning for energy and timeliness,” in Proceedings of USENIX Annual Technical Conference, 2020.
- J. Li, L. Zhao, Y. Yang, K. Zhan, and K. Li, “Tetris: Memory-efficient serverless inference through tensor sharing,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad, CA: USENIX Association, Jul. 2022. [Online]. Available: https://www.usenix.org/conference/atc22/presentation/li-jie
- A. Dhakal, S. G. Kulkarni, and K. Ramakrishnan, “Gslice: controlled spatial sharing of gpus for a scalable inference platform,” in Proceedings of the 11th ACM Symposium on Cloud Computing, 2020.
- K. Andreev and H. Racke, “Balanced graph partitioning,” Theory of Computing Systems, vol. 39, no. 6, pp. 929–939, 2006.
- C. E. Tsourakakis, C. Gkantsidis, B. Radunovic, and M. Vojnovic, “FENNEL: streaming graph partitioning for massive scale graphs,” in Proceedings of the 7th ACM International Conference on Web Search and Data Mining, 2014. ACM, 2014, pp. 333–342. [Online]. Available: https://doi.org/10.1145/2556195.2556213
- https://www.gurobi.com/resource/linear-programming-basics/, 2022.
- https://piveral.com/jetson-nano-power-supply-guide/, 2023.
- https://developer.ridgerun.com/wiki/index.php/NVIDIA_Jetson_TX2_NVP_model, 2023.
- S. Luo, H. Xu, K. Ye, G. Xu, L. Zhang, J. He, G. Yang, and C. Xu, “Erms: Efficient resource management for shared microservices with SLA guarantees,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023. ACM, 2023, pp. 62–77.
- E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge AI: on-demand accelerating deep neural network inference via edge computing,” IEEE Transactions on Wireless Communications, vol. 19, no. 1, pp. 447–457, 2020.
- Y. Li, C. Zhang, S. Han, L. L. Zhang, B. Yin, Y. Liu, and M. Xu, “Boosting mobile CNN inference through semantic memory,” in ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021. ACM, 2021, pp. 2362–2371.
- P. Yu and M. Chowdhury, “Salus: Fine-grained GPU sharing primitives for deep learning applications,” CoRR, vol. abs/1902.04610, 2019. [Online]. Available: http://arxiv.org/abs/1902.04610
- G. Lim, J. Ahn, W. Xiao, Y. Kwon, and M. Jeon, “Zico: Efficient GPU memory sharing for concurrent DNN training,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021, pp. 161–175.
- Z. Dong, Q. He, F. Chen, H. Jin, T. Gu, and Y. Yang, “Edgemove: Pipelining device-edge model training for mobile intelligence,” in Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023. ACM, 2023, pp. 3142–3153.
- Y. Bai, C. Li, Q. Zhou, J. Yi, P. Gong, F. Yan, R. Chen, and Y. Xu, “Gradient compression supercharged high-performance data parallel DNN training,” in SOSP ’21: ACM SIGOPS 28th Symposium on Operating Systems Principles, Virtual Event / Koblenz, Germany, October 26-29, 2021, R. van Renesse and N. Zeldovich, Eds. ACM, 2021, pp. 359–375.
- H. Tian, S. Li, A. Wang, W. Wang, T. Wu, and H. Yang, “Owl: performance-aware scheduling for resource-efficient function-as-a-service cloud,” in Proceedings of the 13th Symposium on Cloud Computing, SoCC 2022, San Francisco, California, November 7-11, 2022, A. Gavrilovska, D. Altinbüken, and C. Binnig, Eds. ACM, 2022, pp. 78–93.
- L. Wang, M. Li, Y. Zhang, T. Ristenpart, and M. M. Swift, “Peeking behind the curtains of serverless platforms,” in 2018 USENIX Annual Technical Conference, USENIX ATC 2018, Boston, MA, USA, July 11-13, 2018, H. S. Gunawi and B. C. Reed, Eds. USENIX Association, 2018, pp. 133–146.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 26th Annual Conference on Neural Information Processing Systems, 2012.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the 3rd International Conference on Learning Representations, 2015, pp. 1–14.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- K. Hsu, K. Bhardwaj, and A. Gavrilovska, “Couper: DNN model slicing for visual analytics containers at the edge,” in Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, 2019.
- H. Jeong, H. Lee, C. H. Shin, and S. Moon, “IONN: incremental offloading of neural network computations from mobile devices to edge servers,” in Proceedings of the 9th ACM Symposium on Cloud Computing, 2018.
- Q. Chen, Z. Zheng, C. Hu, D. Wang, and F. Liu, “On-edge multi-task transfer learning: Model and practice with data-driven task allocation,” IEEE Transactions Parallel Distributed System, vol. 31, no. 6, pp. 1357–1371, 2020.
- https://pytorch.org/serve/, 2020.
- Q. Pei, Y. Yuan, H. Hu, Q. Chen, and F. Liu, “Asyfunc: A high-performance and resource-efficient serverless inference system via asymmetric functions,” in Proceedings of the 2023 ACM Symposium on Cloud Computing, SoCC 2023, Santa Cruz, CA, USA, 30 October 2023 - 1 November 2023. ACM, 2023, pp. 324–340.
- F. Xu, J. Xu, J. Chen, L. Chen, R. Shang, Z. Zhou, and F. Liu, “igniter: Interference-aware GPU resource provisioning for predictable DNN inference in the cloud,” IEEE Transactions Parallel Distributed System, vol. 34, no. 3, pp. 812–827, 2023.
- A. Chen, F. Xu, Y. Dong, L. Chen, Z. Zhou, and F. Liu, “Opara: Exploring operator parallelism for expediting dnn inference on gpus,” China Computer Federation System, Nan Change, China, Tech. Rep., August 2023. [Online]. Available: https://github.com/icloud-ecnu/Opara/blob/main/pdf/tc-opara.pdf
- F. Romero, M. Zhao, N. J. Yadwadkar, and C. Kozyrakis, “Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines,” arXiv preprint arXiv:2102.01887, 2021.
- H. Y. andQuan Chen, M. Riaz, Z. Luan, L. Tang, and J. Mars, “Powerchief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained CMP,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017.
- Q. Chen, H. Yang, M. Guo, R. S. Kannan, J. Mars, and L. Tang, “Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers.” ACM, 2017, pp. 17–32. [Online]. Available: https://doi.org/10.1145/3037697.3037700
- G. Yeung, D. Borowiec, A. Friday, R. Harper, and P. Garraghan, “Towards GPU utilization prediction for cloud deep learning,” in 12th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 2020, July 13-14, 2020, A. Phanishayee and R. Stutsman, Eds. USENIX Association, 2020. [Online]. Available: https://www.usenix.org/conference/hotcloud20/presentation/yeung
- S. Jain, I. Baek, S. Wang, and R. Rajkumar, “Fractional gpus: Software-based compute and memory bandwidth reservation for gpus,” in 25th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2019, Montreal, QC, Canada, April 16-18, 2019. IEEE, 2019, pp. 29–41. [Online]. Available: https://doi.org/10.1109/RTAS.2019.00011
- T. Allen, X. Feng, and R. Ge, “Slate: Enabling workload-aware efficient multiprocessing for modern gpgpus,” in 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, May 20-24, 2019. IEEE, 2019, pp. 252–261. [Online]. Available: https://doi.org/10.1109/IPDPS.2019.00035
- J. J. K. Park, Y. Park, and S. A. Mahlke, “Dynamic resource management for efficient utilization of multitasking gpus,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, Xi’an, China, April 8-12, 2017. ACM, 2017, pp. 527–540. [Online]. Available: https://doi.org/10.1145/3037697.3037707
- Y. Hu, S. Rallapalli, B. Ko, and R. Govindan, “Olympian: Scheduling GPU usage in a deep neural network model serving system,” in Proceedings of the 19th International Middleware Conference, Middleware 2018, Rennes, France, December 10-14, 2018. ACM, 2018, pp. 53–65. [Online]. Available: https://doi.org/10.1145/3274808.3274813
- G. Chen, Y. Zhao, X. Shen, and H. Zhou, “Effisha: A software framework for enabling effficient preemptive scheduling of gpu,” in Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 3–16. [Online]. Available: https://doi.org/10.1145/3018743.3018748
- S. Choi, S. Lee, Y. Kim, J. Park, Y. Kwon, and J. Huh, “Serving heterogeneous machine learning models on multi-gpu servers with spatio-temporal sharing,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad, CA: USENIX Association, 2022. [Online]. Available: https://www.usenix.org/conference/atc22/presentation/choi-seungbeom
- “Tc classful qdiscs,” https://man7.org/linux/man-pages/man8/tc.8.html#CLASSFUL_QDISCS, 2023.
- X. Fu, T. Ghaffar, J. C. Davis, and D. Lee, “Edgewise: A better stream processing engine for the edge,” in 2019 USENIX Annual Technical Conference, USENIX ATC 2019, Renton, WA, USA, July 10-12, 2019. USENIX Association, 2019, pp. 929–946.
- Y. G. Kim and C. Wu, “Autoscale: Energy efficiency optimization for stochastic edge inference using reinforcement learning,” in 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020, Athens, Greece, October 17-21, 2020. IEEE, 2020, pp. 1082–1096.
- Z. Zhang, Y. Zhao, and J. Liu, “Poster abstract: Dvfo: Dynamic voltage, frequency and offloading for efficient ai on edge devices,” in Proceedings of the 22nd International Conference on Information Processing in Sensor Networks, ser. IPSN ’23, New York, NY, USA, 2023, p. 304–305.
- P. Ren, X. Qiao, Y. Huang, L. L. C. Pu, and S. Dustdar, “Fine-grained elastic partitioning for distributed dnn towards mobile web ar services in the 5g era,” IEEE Transactions on Services Computing, vol. 15, no. 6, pp. 3260–3274, 2022.
- K. Fu, J. Shi, Q. Chen, N. Zheng, W. Zhang, D. Zeng, and M. Guo, “Qos-aware irregular collaborative inference for improving throughput of DNN services,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022, F. Wolf, S. Shende, C. Culhane, S. R. Alam, and H. Jagode, Eds. IEEE, 2022, pp. 69:1–69:14.
- Jing Wu (182 papers)
- Lin Wang (403 papers)
- Qirui Jin (3 papers)
- Fangming Liu (33 papers)