Papers
Topics
Authors
Recent
Search
2000 character limit reached

HarmonyBatch: Serverless DNN Inference

Updated 21 January 2026
  • HarmonyBatch is a serverless-native framework for cost-efficient DNN inference using joint batching and dynamic resource provisioning to meet diverse SLOs.
  • It employs analytical performance and cost models to predict CPU/GPU latency and optimize resource allocation across multiple applications.
  • The system features a two-stage merging heuristic that ensures fast provisioning and significant cost reduction while maintaining strict SLO compliance.

HarmonyBatch is a serverless-native resource provisioning and batching framework designed to provide predictable, cost-efficient Deep Neural Network (DNN) inference for multi-application serverless platforms with heterogeneous compute functions (CPU and GPU). It addresses multi-SLO (Service Level Objective) inference scenarios, enabling joint batching of requests with distinct latency deadlines across diverse applications, and dynamically provisions resources to minimize operational cost while ensuring SLO compliance. HarmonyBatch is evaluated and prototyped on Alibaba Cloud Function Compute, featuring analytical performance and cost modeling, and an NP-hard, heuristic-driven joint batching and function selection algorithm (Chen et al., 2024).

1. Architecture and System Components

HarmonyBatch comprises four principal components responsible for profiling, batching, resource selection, and automated grouping:

  • Model Profiler & Performance Predictor: Onboarding a new DNN model triggers the profiler to conduct micro-benchmarks across various batch sizes and CPU/GPU resource slices, extracting coefficients to parameterize empirical latency models. The predictor then estimates average and tail latencies for subsequent batching and resource allocation.
  • Batch Manager: Each application maintains a FIFO queue, tagging each incoming inference request with its SLO deadline. Batches are formed either on expiration of the earliest deadline or when reaching target batch size and are then forwarded for execution.
  • Function Launcher (Resource Provisioner): This component utilizes the analytical models to select the optimal serverless function (CPU or GPU, with required vCPUs or memory), and corresponding batch size to meet the slowest SLO in each batch while minimizing provisioning cost. The Alibaba FC-Open SDK is used for dynamic function invocation and scaling.
  • Two-Stage Group Merger: Periodically, or under workload dynamics, HarmonyBatch re-clusters applications sharing the same model into groups to optimize batching and provisioning plans in response to observed arrival rates and SLO distributions.

This decoupled architecture supports adaptable grouping and dispatching policies, enabling online adaptation to time-varying workloads typical in production serverless environments.

2. Analytical Performance and Cost Models

HarmonyBatch introduces closed-form, empirically-grounded models for CPU and GPU inference latency and monetized operational cost:

CPU Latency: For a batch of size bb on a cc vCPU function, average and maximum inferences are modeled as exponential decay functions:

  • Lavgc(b,c)=αbavgexp(c/βbavg)+γbavgL_{\mathrm{avg}}^{c}(b, c) = \alpha_b^{\mathrm{avg}} \exp(-c/\beta_b^{\mathrm{avg}}) + \gamma_b^{\mathrm{avg}}
  • Lmaxc(b,c)=αbmaxexp(c/βbmax)+γbmaxL_{\mathrm{max}}^{c}(b, c) = \alpha_b^{\mathrm{max}} \exp(-c/\beta_b^{\mathrm{max}}) + \gamma_b^{\mathrm{max}}

where α,β,γ\alpha, \beta, \gamma are batch-specific coefficients.

GPU Latency with Time-Slicing: Under full-device allocation, latency scales linearly:

  • L0g(b)=ξ1b+ξ2L_0^g(b) = \xi_1 b + \xi_2 In Alibaba's cGPU mode with temporal sharing, for mm GB memory, time slices of length τ\tau yield:
  • Lavgg(b,m)=mMmaxL0g(b)L_{\mathrm{avg}}^g(b, m) = \frac{m}{M_{\max}} L_0^g(b)
  • Lmaxg(b,m)=L0g(b)mτ(Mmaxm)τ+L0g(b)L_{\mathrm{max}}^g(b, m) = \left\lceil \frac{L_0^g(b)}{m\tau} \right\rceil (M_{\max} - m)\tau + L_0^g(b)

Poisson Arrivals and Timeout: Requests per application are assumed independent Poisson(rir_i). Batched group timeouts are computed recursively:

  • TX=T1+r2r1+r21exp(r1(T2T1))r1T^X = T_1 + \frac{r_2}{r_1 + r_2} \frac{1 - \exp(- r_1 (T_2 - T_1))}{r_1} Expected batch size for group XX is then E[N]rXTXE[N] \approx r^X T^X.

Cost Model: For group XX with batch size bXb^X, function type t{c,g}t \in \{c,g\}, resource allocation (c,m)(c, m), and average latency LavgtL_{\mathrm{avg}}^t, the cost per request is

  • CX=1bX[Lavgt(cK1+mK2)+K3]C^X = \frac{1}{b^X}\left[L_{\mathrm{avg}}^t(c K_1 + m K_2) + K_3\right] where K1K_1 (CPU-time), K2K_2 (GPU-memory), and K3K_3 (per-invocation overhead) are platform-specific pricing terms.

3. Grouping and Resource Optimization (NP-Hard Formulation)

HarmonyBatch considers nn applications {wi}\{w_i\} sharing a model, each with arrival rate rwir^{w_i} and SLO swis^{w_i}. The central optimization is to partition apps into MM groups G={X1,,XM}\mathcal{G} = \{X_1, \ldots, X_M\}, and for each group select:

  • Function type tXt^X
  • Resource allocation (cX,mX)(c^X, m^X)
  • Batch size bXb^X
  • Timeouts TiT_i

such that the workload-wide expected cost

minG,{tX,cX,mX,bX}XGrXrtotCX\min_{\mathcal{G}, \{t^X, c^X, m^X, b^X\}} \sum_{X \in \mathcal{G}} \frac{r^X}{r^{\textrm{tot}}} C^X

is minimized subject to:

  • SLO constraint: LmaxtX(bX,cX,mX)+TwswL_{\max}^{t^X}(b^X, c^X, m^X) + T_w \leq s^w for all wXw \in X
  • Batch feasibility: bXrXTX+1b^X \leq \lfloor r^X T^X \rfloor + 1
  • GPU-memory limit: mXMreq(bX)m^X \geq M_{\textrm{req}}(b^X)

The resulting non-linear integer program is NP-hard.

4. Two-Stage Merging Heuristic

To address intractability, HarmonyBatch employs a two-stage merging heuristic on applications sorted by SLO:

  • Stage 1 (CPU→GPU pushes): Traverse adjacent apps currently on CPU, merging them until combined rate rXr^X exceeds a platform-specific threshold rr^*. Re-compute group provisioning, accepting merges that reduce cost.
  • Stage 2 (GPU coalescing): Traverse adjacent GPU-served groups, attempting pairwise merges if cost decreases.

Within each group, provisioning leverages two resultants:

  • CPU optimality: For fixed batch size bb, the cost w.r.t. vCPU count cc attains its minimum at the endpoints or stationary point of CX(c)C^X(c).
  • GPU optimality: For fixed GPU memory mm, the largest feasible batch size bb (rXTX+1\lfloor r^X T^X \rfloor + 1) minimizes cost.

Both optimizations are executed via binary search, yielding practical complexity O(nMmaxlogBmax)O(n M_{\max} \log B_{\max}), where nn is the number of applications, MmaxM_{\max} is max GPU memory, and BmaxB_{\max} is max batch size.

5. Implementation and Engineering Considerations

HarmonyBatch is implemented in approximately 1.4K lines of Python, leveraging the Alibaba Cloud Function Compute platform. Notable design choices include:

  • Model profiling: 100 CPU trials for batch sizes b{1,,4}b \in \{1,\ldots,4\} and vCPUs c{0.5,,3.0}c \in \{0.5,\ldots,3.0\}; three GPU runs per batch size and configuration for latency/time-slice fitting.
  • Batching policy: CPU batch sizes are capped at 4 (where CPUs are cost effective); GPU is prioritized for larger batch executions.
  • Online adaptation: HarmonyBatch re-executes the group merging and provisioning routines every TT seconds for workload adaptation.
  • Efficiency: Per-group provisioning leverages binary search as opposed to exhaustive enumeration, ensuring sub-millisecond runtime for optimization (~10× faster than MBS+^+ in Table 2).

6. Experimental Results

Empirical validation uses replayed Azure Functions traces on four DNNs—VideoMAE, VGG-19, BERT, and GPT-2—serving eight co-hosted applications with evenly spaced SLOs within the models’ feasible ranges. Key findings include:

  • Model prediction accuracy: CPU (VideoMAE/VGG-19) average/max latency errors: 0.2–6.1%; GPU (BERT/GPT-2): 0.1–11.4%, which outperform prior regression-based methods due to explicit time-slice modeling.
  • Cost reduction: HarmonyBatch delivers up to 82.9% lower cost than BATCH, and 16–60% lower cost than MBS+^+ (Figure 1).
  • SLO compliance: Achieves zero SLO violations (Figure 2) across workloads.
  • Provisioning speed: Heuristic runs in milliseconds for 12 applications, an order of magnitude faster than MBS+^+.
  • Merge efficacy: Progressive merging reduces cost significantly (e.g., up to 38% for VGG-19 after seven merges).

A summary of these quantitative outcomes is organized below:

Model Cost Reduction vs. BATCH Cost Reduction vs. MBS+^+ SLO Violations
VideoMAE Up to 82.9% 16–60% 0
VGG-19 Up to 82.9% 16–60% 0
BERT Up to 82.9% 16–60% 0
GPT-2 Up to 82.9% 16–60% 0

7. Limitations and Future Work

HarmonyBatch assumes Poisson arrival processes for waiting time and batch size computation, which may not accurately capture bursty or heavy-tailed traffic patterns. Grouping currently operates at the single-model level without supporting cross-model batch fusion. Prototype results are limited to Alibaba Cloud; scheduling semantics (e.g., AWS Lambda GPU) may diverge.

Potential future extensions include incorporating non-Poisson arrival processes into batching timeouts, enabling partitioned or quantized large model inference across multiple functions, extending the heuristic to mixed-model heterogeneous operator batching, and integrating GPU spatial-sharing features where temporal sharing is insufficient.

In summary, HarmonyBatch integrates lightweight profiling, analytical latency and cost modeling, and scalable two-stage merging to deliver provable SLO compliance and significant operational cost reduction for serverless DNN inference workloads (Chen et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HarmonyBatch.