IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency (2308.12871v3)
Abstract: Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows \namex{} to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase. The code and data for replications are available at https://github.com/reconfigurable-ml-pipeline/ipa.
- AI Techniques in Medical Imaging May Lead to False Positives and False Negatives. https://tinyurl.com/628z9tn4. Published on May 12, 2020.
- GRPC. URL: https://grpc.io/.
- Istio. URL: https://istio.io/.
- Kubernetes Python client. https://github.com/kubernetes-client/python.
- MinIO. URL: https://min.io/.
- MLServer. URL: https://github.com/SeldonIO/MLServer.
- Nvidia DeepStream. URL: https://developer.nvidia.com/deepstream-sdk.
- prometheus. URL: https://prometheus.io/.
- Reduce False Positives with Machine Learning. https://complyadvantage.com/insights/reduce-false-positives-with-machine-learning/. Accessed on July 27, 2023.
- Seldon core. URL: https://github.com/SeldonIO/seldon-core.
- Ultralytics yolov5. URL: https://github.com/ultralytics/yolov5.
- Desiderata for next generation of ML model serving. arXiv preprint arXiv:2210.14665, 2022.
- Batch: Machine learning inference serving on serverless platforms with adaptive batching. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2020.
- Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In 14th {normal-{\{{USENIX}normal-}\}} Symposium on Networked Systems Design and Implementation ({normal-{\{{NSDI}normal-}\}} 17), pages 469–482, 2017.
- archiveteam. Archiveteam-twitter-stream-2021-08. https://archive.org/details/archiveteam-twitter-stream-2021-08, 2021.
- Jeff Bar. Amazon EC2 ML inference. https://tinyurl.com/5n8yb5ub, Dec 2019.
- Oakestra: An orchestration framework for edge computing. In Proceedings of the SIGCOMM’22 Poster and Demo Sessions, pages 34–36. 2022.
- Kraken: Adaptive container provisioning for deploying dynamic DAGs in serverless platforms. In Proceedings of the ACM Symposium on Cloud Computing, pages 153–167, 2021.
- Convex optimization. Cambridge university press, 2004.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020. URL: https://arxiv.org/pdf/1908.09791.pdf.
- Serving heterogeneous machine learning models on multi-GPU servers with spatio-temporal sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 199–216, 2022.
- clarifai. Clarifai. URL: https://clarifai.com/clarifai/main/workflows.
- CNBC. Tesla crash that killed two men. https://tinyurl.com/mr25a5mv. Published on April 18, 2021.
- InferLine: Latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing, pages 477–491, 2020.
- Clipper: A low-latency online prediction serving system. In 14th {normal-{\{{USENIX}normal-}\}} Symposium on Networked Systems Design and Implementation ({normal-{\{{NSDI}normal-}\}} 17), pages 613–627, 2017.
- Adaptive stream processing using dynamic batch sizing. In ACM Symposium on Cloud Computing (SoCC), pages 1–13, 2014.
- Rapsai: Accelerating machine learning prototyping of multimedia applications through visual programming. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–23, 2023.
- A guide to deep learning in healthcare. Nature medicine, 25(1):24–29, 2019.
- Open issues in scheduling microservices in the cloud. IEEE Cloud Computing, 3(5):81–88, 2016.
- Dhalion: Self-regulating stream processing in Heron. Very Large Data Bases (PVLDB), 10(12):1825–1836, 2017.
- A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
- Atom: Model-driven autoscaling for microservices. In IEEE International Conference on Distributed Computing Systems (ICDCS), pages 1994–2004, 2019.
- Swayam: Distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, pages 109–120, 2017.
- Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020. URL: https://www.usenix.org/conference/osdi20/presentation/gujarati.
- Cocktail: A multidimensional optimization for model serving in cloud. In USENIX NSDI, pages 1041–1057, 2022.
- RecPipe: Co-designing models and hardware to jointly optimize recommendation quality and performance. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 870–884, 2021.
- Harvard Business Review. When machine learning goes off the rails. https://hbr.org/2021/01/when-machine-learning-goes-off-the-rails, January 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Scrooge: A cost-effective deep learning inference system. In Proceedings of the ACM Symposium on Cloud Computing, pages 624–638, 2021.
- Rim: Offloading inference to the edge. In Proceedings of the International Conference on Internet-of-Things Design and Implementation, pages 80–92, 2021.
- Unicorn: Reasoning about configurable system performance through the lens of causality. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 199–217, 2022.
- Chameleon: Scalable adaptation of video analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 253–266, 2018.
- A systematic review of speech recognition technology in health care. BMC medical informatics and decision making, 14(1):1–14, 2014.
- Three steps is all you need: Fast, accurate, automatic scaling decisions for distributed streaming dataflows. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 783–798, 2018.
- Grandslam: Guaranteeing SLAs for jobs in microservices execution frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–16, 2019.
- Lessons learned from the Chameleon testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC ’20). USENIX Association, July 2020.
- Darpan Kulshreshtha. 10 Instances Where AI Went Wrong. https://tinyurl.com/2p8ywtpd. Published on LinkedIn.
- XRBench: An extended reality (XR) machine learning benchmark suite for the metaverse. arXiv preprint arXiv:2211.08675, 2022.
- PRETZEL: Opening the black box of machine learning prediction serving systems. In 13th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 18), pages 611–626, 2018.
- Performance modeling of serverless computing platforms. IEEE Transactions on Cloud Computing, pages 1–15, 2020.
- Matchmaker: Data drift mitigation in machine learning for large-scale systems. In D. Marculescu, Y. Chi, and C. Wu, editors, Proceedings of Machine Learning and Systems, volume 4, pages 77–94, 2022. URL: https://proceedings.mlsys.org/paper_files/paper/2022/file/1c383cd30b7c298ab50293adfecb7b18-Paper.pdf.
- Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM international conference on Multimedia, pages 1485–1488, 2010.
- Interference-aware scheduling for inference serving. In Proceedings of the 1st Workshop on Machine Learning and Systems, pages 80–88, 2021.
- Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
- Nvidia. https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_Zero_Coding_Sample_Graphs.html#. Deepstream reference graphs.
- Scanner: Efficient video analysis at scale. ACM Trans. Graph., 37(4):138:1–138:13, July 2018. URL: http://doi.acm.org/10.1145/3197517.3201394, https://doi.org/10.1145/3197517.3201394.
- Fa2: Fast, accurate autoscaling for serving deep learning inference with sla guarantees. In 2022 IEEE 28th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 146–159, 2022. https://doi.org/10.1109/RTAS54340.2022.00020.
- INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397–411, 2021.
- Llama: A heterogeneous & serverless framework for auto-tuning video analytics pipelines. arXiv preprint arXiv:2102.01887, 2021.
- Autopilot: Workload autoscaling at Google. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1–16, 2020.
- Reconciling high accuracy, cost-efficiency, and low latency of inference serving systems. In Proceedings of the 3rd Workshop on Machine Learning and Systems, pages 78–86, 2023.
- Nexus: A GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 322–337, 2019.
- Optimizing prediction serving on low-latency serverless dataflow. arXiv preprint arXiv:2007.05832, 2020.
- Cloudburst: Stateful functions-as-a-service. Very Large Data Bases (PVLDB), 13(12):2438–2452, July 2020. https://doi.org/10.14778/3407790.3407836.
- NVIDIA TensorRT. Programmable inference accelerator, 2018.
- Morphling: Fast, near-optimal auto-configuration for cloud-native model serving. In Proceedings of the ACM Symposium on Cloud Computing, pages 639–653, 2021.
- Rafiki: Machine learning as an analytics service system. arXiv preprint arXiv:1804.06087, 2018.
- MLaaS in the wild: Workload analysis and scheduling in large-scale heterogeneous GPU clusters. In 19th {normal-{\{{USENIX}normal-}\}} Symposium on Networked Systems Design and Implementation ({normal-{\{{NSDI}normal-}\}} 22), 2022.
- Promptchainer: Chaining large language model prompts through visual programming. In CHI Conference on Human Factors in Computing Systems Extended Abstracts, pages 1–10, 2022.
- Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–22, 2022.
- Mark: Exploiting cloud services for cost-effective, SLO-aware machine learning inference serving. In 2019 {normal-{\{{USENIX}normal-}\}} Annual Technical Conference ({normal-{\{{USENIX}normal-}\}}{normal-{\{{ATC}normal-}\}} 19), pages 1049–1062, 2019.
- Live video analytics at scale with approximation and delay-tolerance. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 377–392, 2017.
- Model-switching: Dealing with fluctuating workloads in machine-learning-as-a-service systems. In 12th {normal-{\{{USENIX}normal-}\}} Workshop on Hot Topics in Cloud Computing (HotCloud 20), 2020.
- SmartVM: a SLA-aware microservice deployment framework. World Wide Web, 22(1):275–293, 2019.
- Overload control for scaling WeChat microservices. In ACM Symposium on Cloud Computing (SoCC), pages 149–161, 2018.
- Aquatope: Qos-and-uncertainty-aware resource management for multi-stage serverless workflows. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 1–14, 2022.