EdgeSight: Enabling Modeless and Cost-Efficient Inference at the Edge (2405.19213v2)
Abstract: Traditional ML inference is evolving toward modeless inference, which abstracts the complexity of model selection from users, allowing the system to automatically choose the most appropriate model for each request based on accuracy and resource requirements. While prior studies have focused on modeless inference within data centers, this paper tackles the pressing need for cost-efficient modeless inference at the edge -- particularly within its unique constraints of limited device memory, volatile network conditions, and restricted power consumption. To overcome these challenges, we propose EdgeSight, a system that provides cost-efficient EdgeSight serving for diverse DNNs at the edge. EdgeSight employs an edge-data center (edge-DC) architecture, utilizing confidence scaling to reduce the number of model options while meeting diverse accuracy requirements. Additionally, it supports lossy inference in volatile network environments. Our experimental results show that EdgeSight outperforms existing systems by up to 1.6x in P99 latency for modeless services. Furthermore, our FPGA prototype demonstrates similar performance at certain accuracy levels, with a power consumption reduction of up to 3.34x.
- Amazon EC2 Inf1 Instances. https://aws.amazon.com/ec2/instance-types/inf1/?nc1=h_ls.
- AWS Machine Learning Blog. https://aws.amazon.com/blogs/machine-learning/reduce-ml-inference-costs-on-amazon-sagemaker-for-pytorch-models-using-amazon-elastic-inference/.
- BIF Source Code. https://github.com/bif-nsdi23/bif-nsdi23.
- High throughput JPEG decoder Github Repository. https://github.com/ultraembedded/core_jpeg.
- libjpeg-turbo Github Repository. https://github.com/libjpeg-turbo/libjpeg-turbo.
- Live Video Analytics with Microsoft Rocket for reducing edge compute costs. https://techcommunity.microsoft.com/t5/internet-of-things-blog/live-video-analytics-with-microsoft-rocket-for-reducing-edge/ba-p/1522305.
- Microsoft and AT&T demonstrate 5G-powered video analytics. https://azure.microsoft.com/en-us/blog/microsoft-and-att-demonstrate-5gpowered-video-analytics/.
- NVIDIA A100 TENSOR CORE GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf.
- NVIDIA DALI. https://docs.nvidia.com/deeplearning/dali/user-guide/docs/#.
- NVIDIA Jetson: The AI platform for edge computing. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/.
- NVIDIA TRITON INFERENCE SERVER. https://developer.nvidia.com/nvidia-triton-inference-server.
- Vitis AI. https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html.
- Xilinx Alveo U250 Data Center Accelerator Card. https://www.xilinx.com/products/boards-and-kits/alveo/u250.html.
- Xilinx Vitis High Level Synthesis (HLS) tool. https://www.xilinx.com/support/documentation-navigation/design-hubs/2020-2/dh0090-vitis-hls-hub.html.
- Xilinx Vivado® Design Suite. https://www.xilinx.com/products/design-tools/vivado.html.
- Xillinx FINN Github. https://github.com/Xilinx/finn-examples.
- Proteus: A high-throughput inference-serving system with accuracy scaling. ASPLOS ’24. Association for Computing Machinery, 2024.
- Ekya: Continuous learning of video analytics models on edge compute servers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 119–135, Renton, WA, April 2022. USENIX Association.
- Measurement-based, practical techniques to improve 802.11ac performance. In Proceedings of the 2017 Internet Measurement Conference, IMC ’17, page 205–219, New York, NY, USA, 2017. Association for Computing Machinery.
- Finn-r: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 11(3):1–23, 2018.
- P4: Programming protocol-independent packet processors. SIGCOMM Comput. Commun. Rev., 44(3):87–95, jul 2014.
- Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
- Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- Understanding the potential of fpga-based spatial acceleration for large language model inference. ACM Trans. Reconfigurable Technol. Syst., apr 2024. Just Accepted.
- ”bnn - bn = ?”: Training binary neural networks without batch normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 4619–4629, June 2021.
- Cloud-dnn: An open framework for mapping dnn models to cloud fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’19, page 73–82, New York, NY, USA, 2019. Association for Computing Machinery.
- Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR, abs/1602.02830, 2016.
- Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, Boston, MA, March 2017. USENIX Association.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Server-driven video streaming for deep learning inference. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM ’20, page 557–570, New York, NY, USA, 2020. Association for Computing Machinery.
- Post-training piecewise linear quantization for deep neural networks. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 69–86, Cham, 2020. Springer International Publishing.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
- In-network aggregation for shared machine learning clusters. In A. Smola, A. Dimakis, and I. Stoica, editors, Proceedings of Machine Learning and Systems, volume 3, pages 829–844, 2021.
- Serving dnns like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020.
- Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1041–1057, Renton, WA, April 2022. USENIX Association.
- On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017.
- [dl] a survey of fpga-based neural network inference accelerators. ACM Trans. Reconfigurable Technol. Syst., 12(1), mar 2019.
- Sommelier: Curating dnn models for the masses. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD ’22, page 1876–1890, New York, NY, USA, 2022. Association for Computing Machinery.
- HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs. arXiv preprint arXiv:2007.10451, 2020.
- Learning both weights and connections for efficient neural network. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Binarized neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- RECL: Responsive Resource-Efficient continuous learning for video analytics. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 917–932, Boston, MA, April 2023. USENIX Association.
- Bitwise neural networks. arXiv preprint arXiv:1601.06071, 2016.
- Greenscale: Carbon-aware systems for edge computing. arXiv preprint arXiv:2304.00404, 2023.
- Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, page 821–834, New York, NY, USA, 2019. Association for Computing Machinery.
- ATP: In-network aggregation for multi-tenant learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 741–761. USENIX Association, April 2021.
- Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
- Think twice before assure: Confidence estimation for large language models through reflection on multiple answers, 2024.
- AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, Boston, MA, July 2023. USENIX Association.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11976–11986, June 2022.
- Full-stack optimization for accelerating cnns using powers-of-two weights with fpga validation. In Proceedings of the ACM International Conference on Supercomputing, ICS ’19, page 449–460, New York, NY, USA, 2019. Association for Computing Machinery.
- Hairpin: Rethinking packet loss recovery in edge-based interactive video streaming. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), Santa Clara, CA, April 2024. USENIX Association.
- Revisiting the calibration of modern neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 15682–15694. Curran Associates, Inc., 2021.
- A lightweight yolov2: A binarized cnn with a parallel support vector regression for an fpga. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’18, page 31–40, New York, NY, USA, 2018. Association for Computing Machinery.
- Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
- Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139, 2017.
- Gemel: Model merging for Memory-Efficient, Real-Time video analytics at the edge. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 973–994, Boston, MA, April 2023. USENIX Association.
- Memory-efficient dataflow inference for deep cnns on FPGA. In International Conference on Field-Programmable Technology, (IC)FPT 2020, Maui, HI, USA, December 9-11, 2020, pages 48–55. IEEE, 2020.
- Binary neural networks: A survey. Pattern Recognition, 105:107281, 2020.
- Xnor-net: Imagenet classification using binary convolutional neural networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 525–542, Cham, 2016. Springer International Publishing.
- Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 446–459, 2020.
- You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397–411. USENIX Association, July 2021.
- Can the network be the ai accelerator? In Proceedings of the 2018 Morning Workshop on In-Network Computing, NetCompute ’18, page 20–25, New York, NY, USA, 2018. Association for Computing Machinery.
- Scaling distributed machine learning with in-network aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 785–808. USENIX Association, April 2021.
- Adainf: Data drift adaptive scheduling for accurate and slo-guaranteed multiple-model inference serving at edge servers. In Proceedings of the ACM SIGCOMM 2023 Conference, ACM SIGCOMM ’23, page 473–485, New York, NY, USA, 2023. Association for Computing Machinery.
- Re-architecting traffic analysis with neural network interface cards. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 513–533, Renton, WA, April 2022. USENIX Association.
- Taurus: An intelligent data plane. arXiv preprint arXiv:2002.08987, 2020.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, pages 65–74. ACM, 2017.
- Nikita Vemuri. Scoring confidence in neural networks. Master’s thesis, EECS Department, University of California, Berkeley, Jun 2020.
- Pai-fcnn: Fpga based inference system for complex cnn models. In 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), volume 2160-052X, pages 107–114, 2019.
- Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations, 2024.
- Do switches dream of machine learning? toward in-network classification. In Proceedings of the 18th ACM Workshop on Hot Topics in Networks, HotNets ’19, page 25–33, New York, NY, USA, 2019. Association for Computing Machinery.
- Awstream: Adaptive wide-area streaming analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’18, page 236–252, New York, NY, USA, 2018. Association for Computing Machinery.
- MArk: Exploiting cloud services for Cost-Effective, SLO-Aware machine learning inference serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 1049–1062, Renton, WA, July 2019. USENIX Association.
- SHEPHERD: Serving DNNs in the wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 787–808, Boston, MA, April 2023. USENIX Association.
- Model-Switching: Dealing with fluctuating workloads in Machine-Learning-as-a-Service systems. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20). USENIX Association, July 2020.
- Adaptive distributed convolutional neural network inference at the network edge with adcnn. In 49th International Conference on Parallel Processing - ICPP, ICPP ’20, New York, NY, USA, 2020. Association for Computing Machinery.
- AUGUR: Practical mobile multipath transport service for low tail latency in Real-Time streaming. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), Santa Clara, CA, April 2024. USENIX Association.