Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up (2006.02464v2)

Published 3 Jun 2020 in cs.DC and cs.LG

Abstract: Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpredictable execution times. Yet the underlying execution times are not fundamentally unpredictable - on the contrary we observe that inference using Deep Neural Network (DNN) models has deterministic performance. Here, starting with the predictable execution times of individual DNN inferences, we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance. We evaluate our implementation, Clockwork, using production trace workloads, and show that Clockwork can support thousands of models while simultaneously meeting 100ms latency targets for 99.9999% of requests. We further demonstrate that Clockwork exploits predictable execution times to achieve tight request-level service-level objectives (SLOs) as well as a high degree of request-level performance isolation.

Citations (227)

Summary

  • The paper introduces Clockwork, a system architecture that achieves predictable DNN inference through deterministic GPU operations and centralized, action-based scheduling.
  • It demonstrates that the system meets stringent latency targets, maintaining 100ms for 99.9999% of requests and scaling to thousands of concurrent models per GPU.
  • This work establishes a new paradigm in ML infrastructure, enabling reliable and efficient real-time DNN serving in interactive applications.

Essay on "Serving DNNs like Clockwork: Performance Predictability from the Bottom Up"

The paper "Serving DNNs like Clockwork: Performance Predictability from the Bottom Up" addresses the significant issue of latency predictability in ML inference, particularly for Deep Neural Networks (DNNs), which have become pivotal components in interactive web applications. The challenge here lies in consistently meeting stringent low-latency requirements, especially concerning tail latency, which conventional reactive methods have struggled to address. The authors propose a novel system architecture named "Clockwork," designed to achieve predictability by fundamentally restructuring how DNN models are served.

Core Methodology and System Design

The fundamental observation driving this research is that DNN execution, inherently a deterministic sequence of operations, is a predictable process when isolated. This predictability opens avenues to design a system that can guarantee predictable performance even in unpredictable environments. Clockwork leverages this characteristic by organizing the model serving architecture in a way that limits performance variability through consolidated decision-making at a centralized controller.

  • Predictability from Determinism: The authors highlight that DNN inference is deterministic due to the fixed-sequence operations that do not incorporate branches affecting execution paths. This determinism is especially true at the GPU level, where individual operations unfold in a foreseeable manner.
  • Clockwork Architecture: The system forefronts a distributed architecture where a centralized controller orchestrates DNN inference tasks across several workers. Each worker is responsible for managing its GPUs' resources, handling tasks such as model loading, caching, and inference execution exclusively, ensuring performance predictability is preserved throughout the operational flow. By consolidating scheduling and execution decisions in the controller, Clockwork combats internal system unpredictability.
  • Action-Based Scheduling: A notable feature of Clockwork is its action-based approach to scheduling, which allows new actions to be scheduled with precise expectations for execution windows. This methodology not only helps in reducing queue times and managing resources effectively but also ensures that any potential deviation in execution timings can be treated proactively, fostering enhanced stability in overall performance.

Numerical Evaluations and Results

The implementation and evaluation of Clockwork demonstrate its capacity to meet rigorous Service Level Objectives (SLOs), including maintaining 100ms latency targets for a vast majority (99.9999%) of requests. The system can scale to handle thousands of concurrent models per GPU, substantially mitigating tail latency. When benchmarked against existing model serving frameworks like Clipper and INFaaS, Clockwork exhibits superior latency performance and resource utilization.

  • Quantitative Findings: The system shows compelling quantitative results by outperforming Clipper and INFaaS in terms of meeting latency requirements and maintaining high goodput even under the pressures of unpredictable, bursty workloads and myriad contending clients.
  • Scalability and Efficiency: The architecture efficiently embraces scalability without sacrificing predictability, proving its mettle in real-world scenarios with heterogeneous workloads, benefitting applications requiring high dynamics and frequent model updates or tests.

Implications and Future Prospects

The implications of the Clockwork system extend significantly into both academic research and practical applications. The theoretical insight that DNNs can be leveraged predictably reshapes the approach towards ML infrastructure design, motivating further exploration into deterministic execution models. On a practical level, Clockwork paves the way for implementing ML systems in critical real-time applications, enhancing their reliability and efficiency without incurring excess resource costs.

Future research could explore enhancing the predictability of inference processes with varying workloads, advancing the paradigm of deterministic execution models, and adapting the architectural principles to emerging AI technologies, including those involving highly dynamic and non-deterministic ML models.

In conclusion, this work underscores the potential of designing ML systems from the ground-up with predictability as a core parameter, heralding a new era of dependable AI services in interactive systems.