Papers
Topics
Authors
Recent
Search
2000 character limit reached

On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance

Published 25 Mar 2024 in cs.SE, cs.AI, and cs.LG | (2403.17154v3)

Abstract: Deciding what combination of operators to use across the Edge AI tiers to achieve specific latency and model performance requirements is an open question for MLOps engineers. This study aims to empirically assess the accuracy vs inference time trade-off of different black-box Edge AI deployment strategies, i.e., combinations of deployment operators and deployment tiers. In this paper, we conduct inference experiments involving 3 deployment operators (i.e., Partitioning, Quantization, Early Exit), 3 deployment tiers (i.e., Mobile, Edge, Cloud) and their combinations on four widely used Computer-Vision models to investigate the optimal strategies from the point of view of MLOps developers. Our findings suggest that Edge deployment using the hybrid Quantization + Early Exit operator could be preferred over non-hybrid operators (Quantization/Early Exit on Edge, Partition on Mobile-Edge) when faster latency is a concern at medium accuracy loss. However, when minimizing accuracy loss is a concern, MLOps engineers should prefer using only a Quantization operator on edge at a latency reduction or increase, respectively over the Early Exit/Partition (on edge/mobile-edge) and Quantized Early Exit (on edge) operators. In scenarios constrained by Mobile CPU/RAM resources, a preference for Partitioning across mobile and edge tiers is observed over mobile deployment. For models with smaller input data samples (such as FCN), a network-constrained cloud deployment can also be a better alternative than Mobile/Edge deployment and Partitioning strategies. For models with large input data samples (ResNet, ResNext, DUC), an edge tier having higher network/computational capabilities than Cloud/Mobile can be a more viable option than Partitioning and Mobile/Cloud deployment strategies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (122)
  1. Review of deep learning: Concepts, cnn architectures, challenges, applications, future directions. Journal of big Data 8 (2021), 1–74.
  2. openleon: An end-to-end emulator from the edge data center to the mobile users. In Proceedings of the 12th International Workshop on Wireless Network Testbeds, Experimental Evaluation & Characterization (2018), pp. 19–27.
  3. Single-training collaborative object detectors adaptive to bandwidth and computation. arXiv preprint arXiv:2105.00591 (2021).
  4. Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in Neural Information Processing Systems 32 (2019).
  5. Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 13169–13178.
  6. Optimal branch location for cost-effective inference on branchynet. In 2021 IEEE International Conference on Big Data (Big Data) (2021), IEEE, pp. 5071–5080.
  7. Deep feature compression for collaborative object detection. In 2018 25th IEEE International Conference on Image Processing (ICIP) (2018), IEEE, pp. 3743–3747.
  8. Back-and-forth prediction for deep tensor compression. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), IEEE, pp. 4467–4471.
  9. Low-bit quantization of neural networks for efficient inference. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) (2019), IEEE, pp. 3009–3018.
  10. Cliff, N. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological bulletin 114, 3 (1993), 494.
  11. Lightweight compression of neural network feature tensors for collaborative intelligence. In 2020 IEEE International Conference on Multimedia and Expo (ICME) (2020), IEEE, pp. 1–6.
  12. On multiple-comparisons procedures. Los Alamos Sci. Lab. Tech. Rep. LA-7677-MS 1 (1979), 14.
  13. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
  14. Edge intelligence: The confluence of edge computing and artificial intelligence. IEEE Internet of Things Journal 7, 8 (2020), 7457–7469.
  15. Multi-application hierarchical autoscaling for kubernetes edge clusters. In 2023 IEEE International Conference on Smart Computing (SMARTCOMP) (2023), IEEE, pp. 291–296.
  16. Joint optimization with dnn partitioning and resource allocation in mobile edge computing. IEEE Transactions on Network and Service Management 18, 4 (2021), 3973–3986.
  17. Joint optimization of dnn partition and scheduling for mobile cloud computing. In Proceedings of the 50th International Conference on Parallel Processing (2021), pp. 1–10.
  18. Analysis of small sample size studies using nonparametric bootstrap test with pooled resampling method. Statistics in medicine 36, 14 (2017), 2187–2205.
  19. Depth-adaptive transformer. arXiv preprint arXiv:1910.10073 (2019).
  20. Jointdnn: An efficient training and inference engine for intelligent mobile cloud computing services. IEEE Transactions on Mobile Computing 20, 2 (2019), 565–576.
  21. Bottlenet: A deep learning architecture for intelligent mobile cloud computing services. In 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) (2019), IEEE, pp. 1–6.
  22. Training with quantization noise for extreme model compression. arXiv preprint arXiv:2004.07320 (2020).
  23. Post-training piecewise linear quantization for deep neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16 (2020), Springer, pp. 69–86.
  24. openleon: An end-to-end emulation platform from the edge data center to the mobile user. Computer Communications 148 (2019), 17–26.
  25. Confounding tradeoffs for neural network quantization. arXiv preprint arXiv:2102.06366 (2021).
  26. Dynamic precision analog computing for neural networks. IEEE Journal of Selected Topics in Quantum Electronics 29, 2: Optical Computing (2022), 1–12.
  27. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision. Chapman and Hall/CRC, 2022, pp. 291–326.
  28. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  29. E 2 cm: Early exit via class means for efficient supervised and unsupervised learning. In 2022 International Joint Conference on Neural Networks (IJCNN) (2022), IEEE, pp. 1–8.
  30. Class based thresholding in early exit semantic segmentation networks. arXiv preprint arXiv:2210.15621 (2022).
  31. Quantization. IEEE transactions on information theory 44, 6 (1998), 2325–2383.
  32. Greengard, S. Ai on edge. Commun. ACM 63, 9 (aug 2020), 18–20.
  33. Relation of sample size to the stability of component patterns. Psychological bulletin 103, 2 (1988), 265.
  34. An empirical study on the performance and energy consumption of ai containerization strategies for computer-vision tasks on the edge. In Proceedings of the International Conference on Evaluation and Assessment in Software Engineering 2022 (2022), pp. 50–59.
  35. Ps and qs: Quantization-aware pruning for efficient low latency neural network inference. Frontiers in Artificial Intelligence 4 (2021), 676564.
  36. Learning compression from limited unlabeled data. In Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 752–769.
  37. Robust confidence intervals for effect sizes: A comparative study of cohen’sd and cliff’s delta under non-normality and heterogeneous variances. In annual meeting of the American Educational Research Association (2004), vol. 1, Citeseer.
  38. Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics (1979), 65–70.
  39. Dynamic adaptive dnn surgery for inference acceleration on the edge. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications (2019), IEEE, pp. 1423–1431.
  40. Fast and accurate streaming cnn inference via communication compression on the edge. In 2020 IEEE/ACM Fifth International Conference on Internet-of-Things Design and Implementation (IoTDI) (2020), IEEE, pp. 157–163.
  41. Improving post training neural quantization: Layer-wise calibration and integer programming. arXiv preprint arXiv:2006.10518 (2020).
  42. Joint device-edge inference over wireless links with pruning. In 2020 IEEE 21st international workshop on signal processing advances in wireless communications (SPAWC) (2020), IEEE, pp. 1–5.
  43. Computation offloading for machine learning web apps in the edge server environment. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS) (2018), IEEE, pp. 1492–1499.
  44. A survey on load testing of large-scale software systems. IEEE Transactions on Software Engineering 41, 11 (2015), 1091–1118.
  45. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ACM SIGARCH Computer Architecture News 45, 1 (2017), 615–629.
  46. Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342 (2018).
  47. Künas, C. A. Optimizing machine learning models training in the cloud.
  48. Spinn: synergistic progressive inference of neural networks over device and cloud. In Proceedings of the 26th annual international conference on mobile computing and networking (2020), pp. 1–15.
  49. Deep learning. nature 521, 7553 (2015), 436–444.
  50. A splittable dnn-based object detector for edge-cloud collaborative real-time video inference. In 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) (2021), IEEE, pp. 1–8.
  51. Quantization for rapid deployment of deep neural networks. arXiv preprint arXiv:1810.05488 (2018).
  52. Edge intelligence: On-demand deep learning model co-inference with device-edge synergy. In Proceedings of the 2018 Workshop on Mobile Edge Communications (2018), pp. 31–36.
  53. Auto-tuning neural network quantization framework for collaborative inference between the cloud and edge. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I 27 (2018), Springer, pp. 402–411.
  54. Improved techniques for training adaptive deep networks. In Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 1891–1900.
  55. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426 (2021).
  56. Dnn surgery: Accelerating dnn inference on the edge through layer partitioning. IEEE Transactions on Cloud Computing (2023).
  57. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (2014), Springer, pp. 740–755.
  58. Bringing ai to edge: From deep learning’s perspective. Neurocomputing 485 (2022), 297–320.
  59. Fastbert: a self-distilling bert with adaptive inference time. arXiv preprint arXiv:2004.02178 (2020).
  60. A dynamic deep neural network design for efficient workload allocation in edge computing. In 2017 IEEE International Conference on Computer Design (ICCD) (2017), IEEE, pp. 273–280.
  61. Fully convolutional networks for semantic segmentation, 2015.
  62. A survey of image classification methods and techniques for improving classification performance. International journal of Remote sensing 28, 5 (2007), 823–870.
  63. Split: Qos-aware dnn inference on shared gpu via evenly-sized model splitting. In Proceedings of the 52nd International Conference on Parallel Processing (2023), pp. 605–614.
  64. Distilled split deep neural networks for edge-assisted real-time systems. In Proceedings of the 2019 Workshop on Hot Topics in Video Analytics and Intelligent Edges (2019), pp. 21–26.
  65. Bottlefit: Learning compressed representations in deep neural networks for effective and efficient split computing. In 2022 IEEE 23rd International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM) (2022), IEEE, pp. 337–346.
  66. Split computing for complex object detectors: Challenges and preliminary results. In Proceedings of the 4th International Workshop on Embedded and Mobile Deep Learning (2020), pp. 7–12.
  67. Neural compression and filtering for edge-assisted real-time object detection in challenged networks. In 2020 25th International Conference on Pattern Recognition (ICPR) (2021), IEEE, pp. 2272–2279.
  68. Split computing and early exiting for deep learning applications: Survey and research challenges. ACM Computing Surveys 55, 5 (2022), 1–30.
  69. Sc2 benchmark: Supervised compression for split computing. arXiv e-prints (2022), arXiv–2203.
  70. Supervised compression for resource-constrained edge computing systems. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2022), pp. 2685–2695.
  71. Same, same but different: Recovering neural network quantization error through weight factorization. In International Conference on Machine Learning (2019), PMLR, pp. 4486–4495.
  72. Merkel, D. Docker: Lightweight linux containers for consistent development and deployment. Linux J. 2014, 239 (mar 2014).
  73. Image segmentation using deep learning: A survey. IEEE transactions on pattern analysis and machine intelligence 44, 7 (2021), 3523–3542.
  74. Distributed inference acceleration with adaptive dnn partitioning and offloading. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications (2020), IEEE, pp. 854–863.
  75. Machine learning at the network edge: A survey. ACM Computing Surveys (CSUR) 54, 8 (2021), 1–37.
  76. Genetic algorithm-based online-partitioning branchynet for accelerating edge inference. Sensors 23, 3 (2023), 1500.
  77. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2019), pp. 1325–1334.
  78. Large-scale video analytics with cloud–edge collaborative continuous learning. ACM Transactions on Sensor Networks 20, 1 (2023), 1–23.
  79. Icnn: The iterative convolutional neural network. ACM Transactions on Embedded Computing Systems (TECS) 18, 6 (2019), 1–27.
  80. An empirical study of challenges in converting deep learning models. arXiv preprint arXiv:2206.14322 (2022).
  81. Methodology and application of the kruskal-wallis test. In Applied mechanics and materials (2014), vol. 611, Trans Tech Publ, pp. 115–120.
  82. Crime: Input-dependent collaborative inference for recurrent neural networks. IEEE Transactions on Computers 70, 10 (2020), 1626–1639.
  83. Distillation-based training for multi-exit architectures. In Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 1355–1364.
  84. A probabilistic re-intepretation of confidence scores in multi-exit models. Entropy 24, 1 (2021), 1.
  85. Dockemu: Extension of a scalable network simulation framework based on docker and ns3 to cover iot scenarios. In SIMULTECH (2018), pp. 175–182.
  86. Dockemu: An iot simulation framework based on linux containers and the ns-3 network simulator—application to coap iot scenarios. In Simulation and Modeling Methodologies, Technologies and Applications: 8th International Conference, SIMULTECH 2018, Porto, Portugal, July 29-31, 2018, Revised Selected Papers (2020), Springer, pp. 54–82.
  87. Lg-ram: Load-aware global resource affinity management for virtualized multicore systems. Journal of Systems Architecture 98 (2019), 114–125.
  88. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.
  89. Optimal clipping and magnitude-aware differentiation for improved quantization-aware training. In International Conference on Machine Learning (2022), PMLR, pp. 19123–19138.
  90. Cut, distil and encode (cde): Split cloud-edge deep inference. In 2021 18th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON) (2021), IEEE, pp. 1–9.
  91. Why should we add early exits to neural networks? Cognitive Computation 12, 5 (2020), 954–966.
  92. Bottlenet++: An end-to-end approach for feature compression in device-edge co-inference systems. In 2020 IEEE International Conference on Communications Workshops (ICC Workshops) (2020), IEEE, pp. 1–6.
  93. Once quantization-aware training: High performance extremely low-bit architecture search. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 5340–5349.
  94. Post-training sparsity-aware quantization. Advances in Neural Information Processing Systems 34 (2021), 17737–17748.
  95. The cascade transformer: an application for efficient answer sentence selection. arXiv preprint arXiv:2005.02534 (2020).
  96. Tango of edge and cloud execution for reliability. In Proceedings of the 4th Workshop on Middleware for Edge Clouds & Cloudlets (2019), pp. 10–15.
  97. Challenges and obstacles towards deploying deep learning models on mobile devices. CoRR abs/2105.02613 (2021).
  98. Degree-quant: Quantization-aware training for graph neural networks. arXiv preprint arXiv:2008.05000 (2020).
  99. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd international conference on pattern recognition (ICPR) (2016), IEEE, pp. 2464–2469.
  100. Distributed deep neural networks over the cloud, the edge and end devices. In 2017 IEEE 37th international conference on distributed computing systems (ICDCS) (2017), IEEE, pp. 328–339.
  101. Dynexit: A dynamic early-exit strategy for deep residual networks. In 2019 IEEE International Workshop on Signal Processing Systems (SiPS) (2019), IEEE, pp. 178–183.
  102. Understanding convolution for semantic segmentation, 2018.
  103. Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials 22, 2 (2020), 869–904.
  104. Dual dynamic inference: Enabling more efficient, adaptive, and controllable deep inference. IEEE Journal of Selected Topics in Signal Processing 14, 4 (2020), 623–633.
  105. Experimentation in software engineering. Springer Science & Business Media, 2012.
  106. Zero time waste: Recycling predictions in early exit neural networks. Advances in Neural Information Processing Systems 34 (2021), 2516–2528.
  107. Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv preprint arXiv:2004.09602 (2020).
  108. Aggregated residual transformations for deep neural networks, 2017.
  109. Early exiting bert for efficient document ranking. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing (2020), pp. 83–88.
  110. Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993 (2020).
  111. Early exit or not: Resource-efficient blind quality enhancement for compressed images. In European Conference on Computer Vision (2020), Springer, pp. 275–292.
  112. Resolution adaptive networks for efficient inference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020), pp. 2369–2378.
  113. Adaptive dnn surgery for selfish inference acceleration with on-demand edge resource. arXiv preprint arXiv:2306.12185 (2023).
  114. Deep compressive offloading: Speeding up neural network inference by trading edge computation for network latency. In Proceedings of the 18th conference on embedded networked sensor systems (2020), pp. 476–488.
  115. Wide residual networks, 2017.
  116. Boomerang: On-demand cooperative deep neural network inference for edge intelligence on the industrial internet of things. IEEE Network 33, 5 (2019), 96–103.
  117. Nn-meter: Towards accurate latency prediction of deep-learning model inference on diverse edge devices. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services (2021), pp. 81–93.
  118. Effect-dnn: Energy-efficient edge framework for real-time dnn inference. In 2023 IEEE 24th International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM) (2023), IEEE, pp. 10–20.
  119. Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning (2019), PMLR, pp. 7543–7552.
  120. Distributing deep neural networks with containerized partitions at the edge. In 2nd USENIX Workshop on Hot Topics in Edge Computing (HotEdge 19) (2019).
  121. Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems 33 (2020), 18330–18341.
  122. Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proceedings of the IEEE 107, 8 (2019), 1738–1762.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.