SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
Abstract: The increasing deployment of ML models on the critical path of production applications in both datacenter and the edge requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving models under such conditions requires these systems to strike a careful balance between the latency and accuracy requirements of the application and the overall efficiency of utilization of scarce resources. State-of-the-art systems resolve this tension by either choosing a static point in the latency-accuracy tradeoff space to serve all requests or load specific models on the critical path of request serving. In this work, we instead resolve this tension by simultaneously serving the entire-range of models spanning the latency-accuracy tradeoff space. Our novel mechanism, SubNetAct, achieves this by carefully inserting specialized operators in weight-shared SuperNetworks. These operators enable SubNetAct to dynamically route requests through the network to meet a latency and accuracy target. SubNetAct requires upto 2.6x lower memory to serve a vastly-higher number of models than prior state-of-the-art. In addition, SubNetAct's near-instantaneous actuation of models unlocks the design space of fine-grained, reactive scheduling policies. We explore the design of one such extremely effective policy, SlackFit and instantiate both SubNetAct and SlackFit in a real system, SuperServe. SuperServe achieves 4.67% higher accuracy for the same SLO attainment and 2.85x higher SLO attainment for the same accuracy on a trace derived from the real-world Microsoft Azure Functions workload and yields the best trade-offs on a wide range of extremely-bursty synthetic traces automatically.
- Accelerate ai development with google cloud tpus. https://cloud.google.com/tpu.
- Aws inferentia. https://aws.amazon.com/machine-learning/inferentia/.
- Yolo3d: End-to-end real-time 3d oriented object bounding box detection from lidar point cloud. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
- Real-time video analytics: The killer app for edge computing. computer, 50(10):58–67, 2017.
- Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
- Once-for-all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020.
- Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12270–12280, 2021.
- Inferline: ML inference pipeline composition framework. CoRR, abs/1812.01776, 2018.
- Clipper: A {{\{{Low-Latency}}\}} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, 2017.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Facebook. Torchscript, 2022.
- Personal Communication, December 2022.
- Abstractive text summarization by incorporating reader comments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6399–6406, 2019.
- D3: a dynamic deadline-driven approach for building autonomous vehicles. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 453–471, 2022.
- Pylot: A modular platform for exploring latency-accuracy tradeoffs in autonomous vehicles. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 8806–8813. IEEE, 2021.
- Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In International Conference on Learning Representations, 2021.
- Google. grpc, 2022.
- Swayam: Distributed autoscaling to meet slas of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, Middleware ’17, page 109–120, New York, NY, USA, 2017. Association for Computing Machinery.
- Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020.
- Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 620–629. IEEE, 2018.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Dynabert: Dynamic bert with adaptive width and depth. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9782–9793. Curran Associates, Inc., 2020.
- Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1314–1324, 2019.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
- Serving deep learning models in a serverless platform. CoRR, abs/1710.08460, 2017.
- SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2177–2190, Online, July 2020. Association for Computational Linguistics.
- Ameet V Joshi. Amazon’s machine learning toolkit: Sagemaker. In Machine Learning and Artificial Intelligence, pages 233–243. Springer, 2020.
- Morpheus: towards automated {{\{{SLOs}}\}} for enterprise clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 117–134, 2016.
- {{\{{AlpaServe}}\}}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023.
- The architectural implications of autonomous driving: Constraints and acceleration. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 751–766, 2018.
- Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
- Nvidia. Triton inference system, 2020.
- Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139, 2017.
- Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper, 2(11):1–4, 2015.
- Gemel: Model merging for {{\{{Memory-Efficient}}\}},{{\{{Real-Time}}\}} video analytics at the edge. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 973–994, 2023.
- Combinatorial optimization: algorithms and complexity. Courier Corporation, 1998.
- Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886, 2018.
- {{\{{INFaaS}}\}}: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397–411, 2021.
- Infaas policy’s decisions with respect to accuracy constraints, 2022.
- Comp{ofa} – compound once-for-all networks for faster multi-platform deployment. In International Conference on Learning Representations, 2021.
- Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 205–218, 2020.
- Deep learning inference service at microsoft. In 2019 USENIX Conference on Operational Machine Learning (OpML 19), pages 15–17, 2019.
- Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, 2012.
- Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2820–2828, 2019.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
- Custom silicon at facebook: A datacenter infrastructure perspective on video transcoding and machine learning. In 2020 IEEE International Electron Devices Meeting (IEDM), pages 9–7. IEEE, 2020.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Lineage stash: fault tolerance off the critical path. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 338–352, 2019.
- Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE international symposium on high performance computer architecture (HPCA), pages 331–344. IEEE, 2019.
- Infless: A native serverless system for low-latency, high-throughput inference. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22, page 768–781, New York, NY, USA, 2022. Association for Computing Machinery.
- Bignas: Scaling up neural architecture search with big single-stage models. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 702–717, Cham, 2020. Springer International Publishing.
- Slimmable neural networks. arXiv preprint arXiv:1812.08928, 2018.
- Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- {{\{{SHEPHERD}}\}}: Serving {{\{{DNNs}}\}} in the wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 787–808, 2023.
- {{\{{Model-Switching}}\}}: Dealing with fluctuating workloads in {{\{{Machine-Learning-as-a-Service}}\}} systems. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20), 2020.
- Accelerating large scale deep learning inference through {{\{{DeepCPU}}\}} at microsoft. In 2019 USENIX Conference on Operational Machine Learning (OpML 19), pages 5–7, 2019.
- Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
- Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.