HarmonyBatch: Serverless DNN Inference

Updated 21 January 2026

HarmonyBatch is a serverless-native framework for cost-efficient DNN inference using joint batching and dynamic resource provisioning to meet diverse SLOs.
It employs analytical performance and cost models to predict CPU/GPU latency and optimize resource allocation across multiple applications.
The system features a two-stage merging heuristic that ensures fast provisioning and significant cost reduction while maintaining strict SLO compliance.

HarmonyBatch is a serverless-native resource provisioning and batching framework designed to provide predictable, cost-efficient Deep Neural Network (DNN) inference for multi-application serverless platforms with heterogeneous compute functions (CPU and GPU). It addresses multi-SLO (Service Level Objective) inference scenarios, enabling joint batching of requests with distinct latency deadlines across diverse applications, and dynamically provisions resources to minimize operational cost while ensuring SLO compliance. HarmonyBatch is evaluated and prototyped on Alibaba Cloud Function Compute, featuring analytical performance and cost modeling, and an NP-hard, heuristic-driven joint batching and function selection algorithm (Chen et al., 2024).

1. Architecture and System Components

HarmonyBatch comprises four principal components responsible for profiling, batching, resource selection, and automated grouping:

Model Profiler & Performance Predictor: Onboarding a new DNN model triggers the profiler to conduct micro-benchmarks across various batch sizes and CPU/GPU resource slices, extracting coefficients to parameterize empirical latency models. The predictor then estimates average and tail latencies for subsequent batching and resource allocation.
Batch Manager: Each application maintains a FIFO queue, tagging each incoming inference request with its SLO deadline. Batches are formed either on expiration of the earliest deadline or when reaching target batch size and are then forwarded for execution.
Function Launcher (Resource Provisioner): This component utilizes the analytical models to select the optimal serverless function (CPU or GPU, with required vCPUs or memory), and corresponding batch size to meet the slowest SLO in each batch while minimizing provisioning cost. The Alibaba FC-Open SDK is used for dynamic function invocation and scaling.
Two-Stage Group Merger: Periodically, or under workload dynamics, HarmonyBatch re-clusters applications sharing the same model into groups to optimize batching and provisioning plans in response to observed arrival rates and SLO distributions.

This decoupled architecture supports adaptable grouping and dispatching policies, enabling online adaptation to time-varying workloads typical in production serverless environments.

2. Analytical Performance and Cost Models

HarmonyBatch introduces closed-form, empirically-grounded models for CPU and GPU inference latency and monetized operational cost:

CPU Latency: For a batch of size $b$ on a $c$ vCPU function, average and maximum inferences are modeled as exponential decay functions:

$L_{\mathrm{avg}}^{c}(b, c) = \alpha_b^{\mathrm{avg}} \exp(-c/\beta_b^{\mathrm{avg}}) + \gamma_b^{\mathrm{avg}}$
$L_{\mathrm{max}}^{c}(b, c) = \alpha_b^{\mathrm{max}} \exp(-c/\beta_b^{\mathrm{max}}) + \gamma_b^{\mathrm{max}}$

where $\alpha, \beta, \gamma$ are batch-specific coefficients.

GPU Latency with Time-Slicing: Under full-device allocation, latency scales linearly:

$L_0^g(b) = \xi_1 b + \xi_2$ In Alibaba's cGPU mode with temporal sharing, for $m$ GB memory, time slices of length $\tau$ yield:
$L_{\mathrm{avg}}^g(b, m) = \frac{m}{M_{\max}} L_0^g(b)$
$L_{\mathrm{max}}^g(b, m) = \left\lceil \frac{L_0^g(b)}{m\tau} \right\rceil (M_{\max} - m)\tau + L_0^g(b)$

Poisson Arrivals and Timeout: Requests per application are assumed independent Poisson( $r_i$ ). Batched group timeouts are computed recursively:

$T^X = T_1 + \frac{r_2}{r_1 + r_2} \frac{1 - \exp(- r_1 (T_2 - T_1))}{r_1}$ Expected batch size for group $X$ is then $E[N] \approx r^X T^X$ .

Cost Model: For group $X$ with batch size $b^X$ , function type $t \in \{c,g\}$ , resource allocation $(c, m)$ , and average latency $L_{\mathrm{avg}}^t$ , the cost per request is

$C^X = \frac{1}{b^X}\left[L_{\mathrm{avg}}^t(c K_1 + m K_2) + K_3\right]$ where $K_1$ (CPU-time), $K_2$ (GPU-memory), and $K_3$ (per-invocation overhead) are platform-specific pricing terms.

3. Grouping and Resource Optimization (NP-Hard Formulation)

HarmonyBatch considers $n$ applications $\{w_i\}$ sharing a model, each with arrival rate $r^{w_i}$ and SLO $s^{w_i}$ . The central optimization is to partition apps into $M$ groups $\mathcal{G} = \{X_1, \ldots, X_M\}$ , and for each group select:

Function type $t^X$
Resource allocation $(c^X, m^X)$
Batch size $b^X$
Timeouts $T_i$

such that the workload-wide expected cost

$\min_{\mathcal{G}, \{t^X, c^X, m^X, b^X\}} \sum_{X \in \mathcal{G}} \frac{r^X}{r^{\textrm{tot}}} C^X$

is minimized subject to:

SLO constraint: $L_{\max}^{t^X}(b^X, c^X, m^X) + T_w \leq s^w$ for all $w \in X$
Batch feasibility: $b^X \leq \lfloor r^X T^X \rfloor + 1$
GPU-memory limit: $m^X \geq M_{\textrm{req}}(b^X)$

The resulting non-linear integer program is NP-hard.

4. Two-Stage Merging Heuristic

To address intractability, HarmonyBatch employs a two-stage merging heuristic on applications sorted by SLO:

Stage 1 (CPU→GPU pushes): Traverse adjacent apps currently on CPU, merging them until combined rate $r^X$ exceeds a platform-specific threshold $r^*$ . Re-compute group provisioning, accepting merges that reduce cost.
Stage 2 (GPU coalescing): Traverse adjacent GPU-served groups, attempting pairwise merges if cost decreases.

Within each group, provisioning leverages two resultants:

CPU optimality: For fixed batch size $b$ , the cost w.r.t. vCPU count $c$ attains its minimum at the endpoints or stationary point of $C^X(c)$ .
GPU optimality: For fixed GPU memory $m$ , the largest feasible batch size $b$ ( $\lfloor r^X T^X \rfloor + 1$ ) minimizes cost.

Both optimizations are executed via binary search, yielding practical complexity $O(n M_{\max} \log B_{\max})$ , where $n$ is the number of applications, $M_{\max}$ is max GPU memory, and $B_{\max}$ is max batch size.

5. Implementation and Engineering Considerations

HarmonyBatch is implemented in approximately 1.4K lines of Python, leveraging the Alibaba Cloud Function Compute platform. Notable design choices include:

Model profiling: 100 CPU trials for batch sizes $b \in \{1,\ldots,4\}$ and vCPUs $c \in \{0.5,\ldots,3.0\}$ ; three GPU runs per batch size and configuration for latency/time-slice fitting.
Batching policy: CPU batch sizes are capped at 4 (where CPUs are cost effective); GPU is prioritized for larger batch executions.
Online adaptation: HarmonyBatch re-executes the group merging and provisioning routines every $T$ seconds for workload adaptation.
Efficiency: Per-group provisioning leverages binary search as opposed to exhaustive enumeration, ensuring sub-millisecond runtime for optimization (~10× faster than MBS $^+$ in Table 2).

6. Experimental Results

Empirical validation uses replayed Azure Functions traces on four DNNs—VideoMAE, VGG-19, BERT, and GPT-2—serving eight co-hosted applications with evenly spaced SLOs within the models’ feasible ranges. Key findings include:

Model prediction accuracy: CPU (VideoMAE/VGG-19) average/max latency errors: 0.2–6.1%; GPU (BERT/GPT-2): 0.1–11.4%, which outperform prior regression-based methods due to explicit time-slice modeling.
Cost reduction: HarmonyBatch delivers up to 82.9% lower cost than BATCH, and 16–60% lower cost than MBS $^+$ (Figure 1).
SLO compliance: Achieves zero SLO violations (Figure 2) across workloads.
Provisioning speed: Heuristic runs in milliseconds for 12 applications, an order of magnitude faster than MBS $^+$ .
Merge efficacy: Progressive merging reduces cost significantly (e.g., up to 38% for VGG-19 after seven merges).

A summary of these quantitative outcomes is organized below:

Model	Cost Reduction vs. BATCH	Cost Reduction vs. MBS $^+$
VideoMAE	Up to 82.9%	16–60%
VGG-19	Up to 82.9%	16–60%
BERT	Up to 82.9%	16–60%
GPT-2	Up to 82.9%	16–60%

7. Limitations and Future Work

HarmonyBatch assumes Poisson arrival processes for waiting time and batch size computation, which may not accurately capture bursty or heavy-tailed traffic patterns. Grouping currently operates at the single-model level without supporting cross-model batch fusion. Prototype results are limited to Alibaba Cloud; scheduling semantics (e.g., AWS Lambda GPU) may diverge.

Potential future extensions include incorporating non-Poisson arrival processes into batching timeouts, enabling partitioned or quantized large model inference across multiple functions, extending the heuristic to mixed-model heterogeneous operator batching, and integrating GPU spatial-sharing features where temporal sharing is insufficient.

In summary, HarmonyBatch integrates lightweight profiling, analytical latency and cost modeling, and scalable two-stage merging to deliver provable SLO compliance and significant operational cost reduction for serverless DNN inference workloads (Chen et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HarmonyBatch.

HarmonyBatch: Serverless DNN Inference

1. Architecture and System Components

2. Analytical Performance and Cost Models

3. Grouping and Resource Optimization (NP-Hard Formulation)

4. Two-Stage Merging Heuristic

5. Implementation and Engineering Considerations

6. Experimental Results

7. Limitations and Future Work

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

HarmonyBatch: Serverless DNN Inference

1. Architecture and System Components

2. Analytical Performance and Cost Models

3. Grouping and Resource Optimization (NP-Hard Formulation)

4. Two-Stage Merging Heuristic

5. Implementation and Engineering Considerations

6. Experimental Results

7. Limitations and Future Work

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research