Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips (2308.05869v2)
Abstract: Two distinguishing features of state-of-the-art mobile and autonomous systems are 1) there are often multiple workloads, mainly deep neural network (DNN) inference, running concurrently and continuously; and 2) they operate on shared memory system-on-chips (SoC) that embed heterogeneous accelerators tailored for specific operations. State-of-the-art lacks efficient performance and resource management techniques necessary to either maximize total system throughput or minimize end-to-end workload latency. In this work, we propose HaX-CoNN, a novel scheme that characterizes and maps layers in concurrently executing DNN inference workloads to a diverse set of accelerators within a SoC. Our scheme uniquely takes per-layer execution characteristics, shared memory (SM) contention, and inter-accelerator transitions into account to find optimal schedules. We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SoCs. Our experimental results indicate that HaX-CoNN minimizes memory contention by up to 45% and can improve latency and total throughput by up to 32% and 29%, respectively, compared to the state-of-the-art approaches.
- NVIDIA Deep Learning Accelerator. 2023. http://nvdla.org/ (accessed on 08/04/2023).
- Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1109/MICRO.2016.7783725
- Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. Journal of big Data 8 (2021), 1–74.
- Cumas: Data transfer aware multi-application scheduling for shared gpus. In Proceedings of the 2016 International Conference on Supercomputing. 1–12.
- Roborun: A robot runtime to exploit spatial heterogeneity. In 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 829–834.
- Google neural network models for edge devices: Analyzing and mitigating machine learning inference bottlenecks. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 159–172.
- {{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.
- NVIDIA Nsight Compute. 2022. https://docs.nvidia.com/nsight-compute/NsightCompute/index.html (accessed on 08/04/2023).
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3213–3223.
- Ismet Dagli and Mehmet E Belviranli. 2021. Multi-accelerator Neural Network Inference in Diversely Heterogeneous Embedded Systems. In 2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA). IEEE, 1–7.
- AxoNN: Energy-Aware Execution of Neural Network Inference on Multi-Accelerator Heterogeneous SoCs. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC).
- Contention-Aware Performance Modeling for Heterogeneous Edge and Cloud Systems. In Proceedings of the 3rd Workshop on Flexible Resource and Application Management on the Edge (Orlando, FL, USA) (FRAME ’23). Association for Computing Machinery, New York, NY, USA, 27–31. https://doi.org/10.1145/3589010.3594889
- Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In Tools and Algorithms for the Construction and Analysis of Systems: 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings 14. Springer, 337–340.
- ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture(ISCA). 92–104.
- A survey of deep learning techniques for autonomous driving. Journal of Field Robotics 37, 3 (2020), 362–386.
- Safety-critical advanced robots: A survey. Robotics and Autonomous Systems 94 (2017), 43–52.
- Gurobi Optimization, LLC. 2023. Gurobi Optimizer Reference Manual. https://www.gurobi.com
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR). 770–778.
- Mind mappings: enabling efficient algorithm-accelerator mapping space search. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 943–958.
- Mark Hill and Vijay Janapa Reddi. 2019. Gables: A roofline model for mobile socs. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 317–330.
- Zhuyi: perception processing rate estimation for safety in autonomous vehicles. In Proceedings of the 59th ACM/IEEE Design Automation Conference. 289–294.
- Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).
- QoS-aware placement of deep learning services on the edge with multiple service implementations. In 2021 International Conference on Computer Communications and Networks (ICCCN). IEEE, 1–8.
- Deep learning inference parallelization on heterogeneous processors with tensorrt. IEEE Embedded Systems Letters 14, 1 (2021), 15–18.
- CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. Association for Computing Machinery New York, NY, USA, 209–221.
- Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).
- A unified architecture for accelerating distributed {{\{{DNN}}\}} training in heterogeneous {{\{{GPU/CPU}}\}} clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 463–479.
- Scheduling of deep learning applications onto heterogeneous processors in an embedded device. IEEE Access 8 (2020), 43980–43991.
- Sheng-Chun Kao and Tushar Krishna. 2020. Gamma: Automating the hw mapping of dnn models on accelerators via genetic algorithm. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1–9.
- Sheng-Chun Kao and Tushar Krishna. 2022. MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 814–830.
- FLAT: An Optimized Dataflow for Mitigating Attention Performance Bottlenecks. published in arxiv, will appear in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2023).
- Andreas Karatzas and Iraklis Anagnostopoulos. 2023. OmniBoost: Boosting Throughput of Heterogeneous Embedded Devices under Multi-DNN Workload. arXiv preprint arXiv:2307.03290 (2023).
- MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 828–841.
- Roofline model for uavs: A bottleneck analysis tool for onboard compute characterization of autonomous unmanned aerial vehicles. In 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 162–174.
- Automatic Domain-Specific SoC Design for Autonomous Unmanned Aerial Vehicles. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 300–317.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems(NeurIPS) 25 (2012).
- Heterogeneous dataflow accelerators for multi-DNN workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 71–83.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
- A Many-Ported and Shared Memory Architecture for High-Performance ADAS SoCs. IEEE Design & Test 39, 6 (2022), 5–15.
- DRAGON: breaking GPU memory capacity limits with direct NVM access. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 414–426.
- Mephesto: Modeling energy-performance in heterogeneous socs and their trade-offs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 413–425.
- PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles(SOSP). 1–15.
- {{\{{Heterogeneity-Aware}}\}} Cluster Scheduling Policies for Deep Learning Workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 481–498.
- DNNFusion: accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation(PLDI). 883–898.
- Functional architecture for autonomous driving and its implementation. In 2020 17th Biennial Baltic Electronics Conference (BEC). IEEE, 1–6.
- NVIDIA. 2023a. AI-Powered Autonomous Machines at Scale — NVIDIA Jetson AGX Xavier. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-agx-xavier/. (accessed on 08/04/2023).
- NVIDIA. 2023b. Next-level AI performance for next-gen robotics — NVIDIA Jetson Orin AGX. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/. (accessed on 08/04/2023).
- NVIDIA. 2023c. TensorRT. https://developer.nvidia.com/tensorrt (accessed on 08/04/2023).
- NVIDIA. 2023d. TensorRT IProfiler. https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_profiler.html (accessed on 08/04/2023).
- Hetpipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference. 307–321.
- A computational-graph partitioning method for training memory-constrained DNNs. Parallel computing 104 (2021), 102792.
- Qualcomm. 2023a. Neural Processing SDK for AI. https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk (accessed on 08/04/2023).
- Qualcomm. 2023b. Snapdragon 865 Mobile Hardware Development Kit. https://stage.developer.qualcomm.com/hardware/snapdragon-865-hdk. (accessed on 08/04/2023).
- Multi-object detection and tracking, based on DNN, for autonomous vehicles: A review. IEEE Sensors Journal 21, 5 (2020), 5668–5677.
- SMT solvers for job-shop scheduling problems: Models comparison and performance evaluation. In 2018 IEEE 14th International Conference on Automation Science and Engineering (CASE). IEEE, 547–552.
- Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211–252.
- Roberto Sebastiani and Patrick Trentin. 2015. OptiMathSAT: A tool for optimization modulo theories. In International conference on computer aided verification. Springer, 447–454.
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations(ICLR).
- Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.
- Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In AAAI.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
- Tesla. 2023. Tesla Autopilot AI. https://www.tesla.com/AI
- Energy-efficient runtime management of heterogeneous multicores using online projection. ACM Transactions on Architecture and Code Optimization (TACO) 15, 4 (2019), 1–26.
- Analyzing and Improving Resilience and Robustness of Autonomous Systems. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design. 1–9.
- VR content creation and exploration with deep learning: A survey. Computational Visual Media 6 (2020), 3–28.
- A pipeline-based scheduler for optimizing latency of convolution neural network inference over heterogeneous multicore systems. In IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 46–49.
- PCCS: Processor-Centric Contention-Aware Slowdown Model for Heterogeneous System-on-Chips. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 1282–1295. https://doi.org/10.1145/3466752.3480101
- A full-stack search technique for domain optimized deep learning accelerators. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 27–42.
- H2H: Heterogeneous Model to Heterogeneous System Mapping with Computation and Communication Awareness. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC).
- Safety score: A quantitative approach to guiding safety-aware autonomous vehicle computing system design. In 2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, 1479–1485.
- Co-run scheduling with power cap on integrated cpu-gpu systems. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 967–977.
- Resilience-Aware Mixed-Criticality DAG Scheduling on Multi-cores for Autonomous Systems. ACM SIGAda Ada Letters 42, 1 (2022), 81–85.