HarmonyBatch: Serverless DNN Inference
- HarmonyBatch is a serverless-native framework for cost-efficient DNN inference using joint batching and dynamic resource provisioning to meet diverse SLOs.
- It employs analytical performance and cost models to predict CPU/GPU latency and optimize resource allocation across multiple applications.
- The system features a two-stage merging heuristic that ensures fast provisioning and significant cost reduction while maintaining strict SLO compliance.
HarmonyBatch is a serverless-native resource provisioning and batching framework designed to provide predictable, cost-efficient Deep Neural Network (DNN) inference for multi-application serverless platforms with heterogeneous compute functions (CPU and GPU). It addresses multi-SLO (Service Level Objective) inference scenarios, enabling joint batching of requests with distinct latency deadlines across diverse applications, and dynamically provisions resources to minimize operational cost while ensuring SLO compliance. HarmonyBatch is evaluated and prototyped on Alibaba Cloud Function Compute, featuring analytical performance and cost modeling, and an NP-hard, heuristic-driven joint batching and function selection algorithm (Chen et al., 2024).
1. Architecture and System Components
HarmonyBatch comprises four principal components responsible for profiling, batching, resource selection, and automated grouping:
- Model Profiler & Performance Predictor: Onboarding a new DNN model triggers the profiler to conduct micro-benchmarks across various batch sizes and CPU/GPU resource slices, extracting coefficients to parameterize empirical latency models. The predictor then estimates average and tail latencies for subsequent batching and resource allocation.
- Batch Manager: Each application maintains a FIFO queue, tagging each incoming inference request with its SLO deadline. Batches are formed either on expiration of the earliest deadline or when reaching target batch size and are then forwarded for execution.
- Function Launcher (Resource Provisioner): This component utilizes the analytical models to select the optimal serverless function (CPU or GPU, with required vCPUs or memory), and corresponding batch size to meet the slowest SLO in each batch while minimizing provisioning cost. The Alibaba FC-Open SDK is used for dynamic function invocation and scaling.
- Two-Stage Group Merger: Periodically, or under workload dynamics, HarmonyBatch re-clusters applications sharing the same model into groups to optimize batching and provisioning plans in response to observed arrival rates and SLO distributions.
This decoupled architecture supports adaptable grouping and dispatching policies, enabling online adaptation to time-varying workloads typical in production serverless environments.
2. Analytical Performance and Cost Models
HarmonyBatch introduces closed-form, empirically-grounded models for CPU and GPU inference latency and monetized operational cost:
CPU Latency: For a batch of size on a vCPU function, average and maximum inferences are modeled as exponential decay functions:
where are batch-specific coefficients.
GPU Latency with Time-Slicing: Under full-device allocation, latency scales linearly:
- In Alibaba's cGPU mode with temporal sharing, for GB memory, time slices of length yield:
Poisson Arrivals and Timeout: Requests per application are assumed independent Poisson(). Batched group timeouts are computed recursively:
- Expected batch size for group is then .
Cost Model: For group with batch size , function type , resource allocation , and average latency , the cost per request is
- where (CPU-time), (GPU-memory), and (per-invocation overhead) are platform-specific pricing terms.
3. Grouping and Resource Optimization (NP-Hard Formulation)
HarmonyBatch considers applications sharing a model, each with arrival rate and SLO . The central optimization is to partition apps into groups , and for each group select:
- Function type
- Resource allocation
- Batch size
- Timeouts
such that the workload-wide expected cost
is minimized subject to:
- SLO constraint: for all
- Batch feasibility:
- GPU-memory limit:
The resulting non-linear integer program is NP-hard.
4. Two-Stage Merging Heuristic
To address intractability, HarmonyBatch employs a two-stage merging heuristic on applications sorted by SLO:
- Stage 1 (CPU→GPU pushes): Traverse adjacent apps currently on CPU, merging them until combined rate exceeds a platform-specific threshold . Re-compute group provisioning, accepting merges that reduce cost.
- Stage 2 (GPU coalescing): Traverse adjacent GPU-served groups, attempting pairwise merges if cost decreases.
Within each group, provisioning leverages two resultants:
- CPU optimality: For fixed batch size , the cost w.r.t. vCPU count attains its minimum at the endpoints or stationary point of .
- GPU optimality: For fixed GPU memory , the largest feasible batch size () minimizes cost.
Both optimizations are executed via binary search, yielding practical complexity , where is the number of applications, is max GPU memory, and is max batch size.
5. Implementation and Engineering Considerations
HarmonyBatch is implemented in approximately 1.4K lines of Python, leveraging the Alibaba Cloud Function Compute platform. Notable design choices include:
- Model profiling: 100 CPU trials for batch sizes and vCPUs ; three GPU runs per batch size and configuration for latency/time-slice fitting.
- Batching policy: CPU batch sizes are capped at 4 (where CPUs are cost effective); GPU is prioritized for larger batch executions.
- Online adaptation: HarmonyBatch re-executes the group merging and provisioning routines every seconds for workload adaptation.
- Efficiency: Per-group provisioning leverages binary search as opposed to exhaustive enumeration, ensuring sub-millisecond runtime for optimization (~10× faster than MBS in Table 2).
6. Experimental Results
Empirical validation uses replayed Azure Functions traces on four DNNs—VideoMAE, VGG-19, BERT, and GPT-2—serving eight co-hosted applications with evenly spaced SLOs within the models’ feasible ranges. Key findings include:
- Model prediction accuracy: CPU (VideoMAE/VGG-19) average/max latency errors: 0.2–6.1%; GPU (BERT/GPT-2): 0.1–11.4%, which outperform prior regression-based methods due to explicit time-slice modeling.
- Cost reduction: HarmonyBatch delivers up to 82.9% lower cost than BATCH, and 16–60% lower cost than MBS (Figure 1).
- SLO compliance: Achieves zero SLO violations (Figure 2) across workloads.
- Provisioning speed: Heuristic runs in milliseconds for 12 applications, an order of magnitude faster than MBS.
- Merge efficacy: Progressive merging reduces cost significantly (e.g., up to 38% for VGG-19 after seven merges).
A summary of these quantitative outcomes is organized below:
| Model | Cost Reduction vs. BATCH | Cost Reduction vs. MBS | SLO Violations |
|---|---|---|---|
| VideoMAE | Up to 82.9% | 16–60% | 0 |
| VGG-19 | Up to 82.9% | 16–60% | 0 |
| BERT | Up to 82.9% | 16–60% | 0 |
| GPT-2 | Up to 82.9% | 16–60% | 0 |
7. Limitations and Future Work
HarmonyBatch assumes Poisson arrival processes for waiting time and batch size computation, which may not accurately capture bursty or heavy-tailed traffic patterns. Grouping currently operates at the single-model level without supporting cross-model batch fusion. Prototype results are limited to Alibaba Cloud; scheduling semantics (e.g., AWS Lambda GPU) may diverge.
Potential future extensions include incorporating non-Poisson arrival processes into batching timeouts, enabling partitioned or quantized large model inference across multiple functions, extending the heuristic to mixed-model heterogeneous operator batching, and integrating GPU spatial-sharing features where temporal sharing is insufficient.
In summary, HarmonyBatch integrates lightweight profiling, analytical latency and cost modeling, and scalable two-stage merging to deliver provable SLO compliance and significant operational cost reduction for serverless DNN inference workloads (Chen et al., 2024).