Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU (2307.04339v1)
Abstract: Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack hardware-level resource management mechanisms for avoiding resource contention. Therefore, we propose Miriam, a contention-aware task coordination framework for multi-DNN inference on edge GPU. Miriam consolidates two main components, an elastic-kernel generator, and a runtime dynamic kernel coordinator, to support mixed critical DNN inference. To evaluate Miriam, we build a new DNN inference benchmark based on CUDA with diverse representative DNN workloads. Experiments on two edge GPU platforms show that Miriam can increase system throughput by 92% while only incurring less than 10\% latency overhead for critical tasks, compared to state of art baselines.
- Amd ryzen™ embedded family. https://www.amd.com/en/products/embedded-ryzen-series?gclid=Cj0KCQjwtsCgBhDEARIsAE7RYh1ldW0JK-snwE61wNNbhkSG8acBGKg5IpqwGrFXC7Hs85Fj4jWHcA8aAtj0EALw_wcB.
- Nvidia nsight systems. https://developer.nvidia.com/nsight-systems.
- Nvidia triton inference server organization. https://github.com/triton-inference-server.
- Effisha: A software framework for enabling effficient preemptive scheduling of gpu. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 3–16, 2017.
- TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, Carlsbad, CA, October 2018. USENIX Association.
- Enable simultaneous dnn services based on deterministic operator overlap and precise latency prediction. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
- Gate-variants of gated recurrent unit (gru) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), pages 1597–1600. IEEE, 2017.
- Nestdnn: Resource-aware multi-tenant on-device deep learning for continuous mobile vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, MobiCom ’18, page 115–127, New York, NY, USA, 2018. Association for Computing Machinery.
- Demystifying the placement policies of the nvidia gpu thread block scheduler for concurrent kernels. ACM SIGMETRICS Performance Evaluation Review, 48(3):81–88, 2021.
- A study of persistent threads style GPU programming for GPGPU workloads. IEEE, 2012.
- Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539–558, Carlsbad, CA, July 2022. USENIX Association.
- Orion: A framework for gpu occupancy tuning. In Proceedings of the 17th International Middleware Conference, pages 1–13, 2016.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
- LG Electronics Inc. Lgsvl simulator: An autonomous vehicle simulator. https://www.svlsimulator.com/docs/archive/2020.06/getting-started/, 2020.
- Dynamic space-time scheduling for GPU inference. CoRR, abs/1901.00041, 2019.
- Fractional gpus: Software-based compute and memory bandwidth reservation for gpus. In 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 29–41. IEEE, 2019.
- Band: coordinated multi-dnn inference on heterogeneous mobile processors. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, pages 235–247, 2022.
- Deepcuts: a deep learning optimization framework for versatile gpu workloads. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 190–205, 2021.
- Tango: A deep neural network benchmark suite for various accelerators. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 137–138. IEEE, 2019.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
- Rt-mdl: Supporting real-time mixed deep learning tasks on edge platforms. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, pages 1–14, 2021.
- Veltair: Towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling. arXiv preprint arXiv:2201.06212, 2022.
- Deepeye: Resource efficient local execution of multiple deep vision models using wearable commodity hardware. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 68–81, 2017.
- Interference-aware scheduling for inference serving. In Proceedings of the 1st Workshop on Machine Learning and Systems, pages 80–88, 2021.
- NVIDIA. Nvidia mig. https://www.nvidia.com/en-us/technologies/multi-instance-gpu/, 2020.
- NVIDIA. Nvidia mps. https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf, 2020.
- CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional, 2010.
- Residual attention network for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2017.
- Efficient performance estimation and work-group size pruning for opencl kernels on gpus. IEEE Transactions on Parallel and Distributed Systems, 31(5):1089–1106, 2019.
- Efficient performance estimation and work-group size pruning for opencl kernels on gpus. IEEE Transactions on Parallel and Distributed Systems, 31(5):1089–1106, 2020.
- Enabling and exploiting flexible task assignment on gpu through sm-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 119–130, 2015.
- A model-based software solution for simultaneous multiple kernels on gpus. ACM Transactions on Architecture and Code Optimization (TACO), 17(1):1–26, 2020.
- Pipelined data-parallel cpu/gpu scheduling for multi-dnn real-time inference. In 2019 IEEE Real-Time Systems Symposium (RTSS), pages 392–405. IEEE, 2019.
- Warped-slicer: Efficient intra-sm slicing through dynamic resource partitioning for gpu multiprogramming. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pages 230–242. IEEE, 2016.
- Horus: Interference-aware and prediction-based scheduling in deep learning systems. IEEE Transactions on Parallel and Distributed Systems, 33(1):88–100, 2021.
- Heimdall: mobile gpu coordination platform for augmented reality applications. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, pages 1–14, 2020.
- Automated runtime-aware scheduling for multi-tenant dnn inference on gpu. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–9. IEEE, 2021.
- Hsm: A hybrid slowdown model for multitasking gpus. In Proceedings of the twenty-fifth international conference on architectural support for programming languages and operating systems, pages 1371–1385, 2020.
- Edgeml: An automl framework for real-time deep learning on the edge. In Proceedings of the International Conference on Internet-of-Things Design and Implementation, IoTDI ’21, page 133–144, New York, NY, USA, 2021. Association for Computing Machinery.
- Ansor: Generating high-performance tensor programs for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 863–879. USENIX Association, November 2020.