Papers
Topics
Authors
Recent
Search
2000 character limit reached

BucketServe: Dynamic Batching for LLM Inference

Updated 21 March 2026
  • BucketServe is a bucket-based dynamic batching framework that adaptively groups LLM inference requests by sequence length to reduce padding waste and meet latency SLOs.
  • It features a multi-component architecture—including a Request Bucketing Manager, Dynamic Batching Controller, and priority-aware scheduler—to optimize GPU memory usage.
  • Empirical evaluations demonstrate that BucketServe significantly increases throughput and GPU utilization while maintaining SLO compliance under diverse, heterogeneous workloads.

BucketServe is a bucket-based dynamic batching framework engineered to optimize inference performance for LLMs under heterogeneous workloads. Unlike traditional LLM serving systems that rely on static or continuous batching—often resulting in inefficient GPU memory utilization and increased latency—BucketServe adaptively groups and schedules requests by sequence length, dynamically adjusts batch sizes to hardware constraints, and integrates priority-aware scheduling to satisfy service level objectives (SLOs) (Zheng et al., 23 Jul 2025). Its design addresses the fundamental tension between maximizing throughput and maintaining strict latency requirements in real-time LLM applications.

1. System Architecture and Component Workflow

BucketServe comprises five primary components: Gateway, Request Bucketing Manager, Dynamic Batching Controller, P/D Scheduler, and Global Monitor. The typical request processing pipeline involves the following stages:

  • Gateway: Receives user requests and annotates them with metadata including sequence length, task type, and priority.
  • Request Bucketing Manager: Maintains a set B={b1,,bK}\mathcal{B} = \{b_1,\ldots,b_K\} of buckets, each associated with an interval [Lb,Ub)[L_b,U_b). An incoming request is assigned to the unique bucket whose interval contains its sequence length. Buckets are dynamically split or merged as workload fluctuates.
  • Dynamic Batching Controller: Periodically (or when a queue reaches a threshold), for each bucket bb, it computes the safe GPU memory Msafe=0.9MremainM_\text{safe}=0.9 M_\text{remain}. It determines Nmax=max{N  i=1NMemoryKV(i)Msafe}N_\text{max} = \max\{N~|~\sum_{i=1}^N \text{Memory}_\text{KV}(i)\leq M_\text{safe}\}, where MemoryKV(i)=2LHDSmaxB\text{Memory}_\text{KV}(i) = 2\cdot L\cdot H\cdot D\cdot S_\text{max}\cdot B, and selects up to NmaxN_\text{max} requests for batching and padded submission.
  • P/D Scheduler: Handles prefill (building key-value (KV) caches on a first-come-first-served (FCFS) basis), orchestrates KV-cache transfer via NVLink, and manages decoding (using continuous batching per Orca-style strategies).
  • Global Monitor: Tracks GPU and system metrics, feeding back into the Bucketing Manager and Batching Controller for online adjustment.

Pipeline flow, as per the architecture, is:

1
2
3
4
5
User → Gateway → Bucketing Manager → Buckets b₁,…,b_K
  │
  └─> Dynamic Batching Controller ──> Prefill Queue ──> Prefill Workers
                                                              ↓ (NVLink)
                                   Decoding Queue ─ Decoding Workers → User

2. Bucket Formation, Waste Minimization, and Dynamic Batching

Bucket formation is realized by partitioning the incoming request stream according to sequence length into KK intervals [Lb,Ub)[L_b,U_b). Each bucket contains requests of approximately similar length, which minimizes input sequence padding and associated computational waste.

  • Padding Overhead for a batch of NN requests with lengths {Si}\{S_i\} is quantified as:

WasteRatio=SmaxSavgSmax(Eq. 2)\text{Waste}_\text{Ratio} = \frac{S_\text{max}-S_\text{avg}}{S_\text{max}} \quad \text{(Eq.~2)}

where Smax=maxiSiS_\text{max} = \max_i S_i, Savg=(1/N)iSiS_\text{avg} = (1/N)\sum_i S_i.

  • Expected Waste is the aggregate padding overhead across all buckets:

E[Waste]=b=1KLbUb(1SUb)f(S)dS(Eq. 3)\mathbb{E}[\text{Waste}] = \sum_{b=1}^K \int_{L_b}^{U_b} \left(1-\frac{S}{U_b}\right) f(S)\, dS \quad \text{(Eq.~3)}

where f(S)f(S) is the PDF of incoming sequence lengths.

  • Optimal Bucket Boundary to minimize expected waste is specified as:

Ub=LbUbSf(S)dSLbUbf(S)dS(Eq. 4)U_b^* = \frac{\int_{L_b}^{U_b} S f(S) dS}{\int_{L_b}^{U_b} f(S) dS} \quad \text{(Eq.~4)}

Practically, bucket boundaries are approximated via midpoint bisection.

Dynamic Batching leverages real-time GPU memory measurements. On each batch cycle:

  1. Msafe=0.9MremainM_\text{safe} = 0.9 M_\text{remain} is computed.
  2. Per-request memory cost MemoryKV=2LHDSmaxB\text{Memory}_\text{KV}=2 L H D S_\text{max} B is calculated.
  3. NmaxN_\text{max} is computed such that NmaxMemoryKVMsafeN_\text{max}\cdot \text{Memory}_\text{KV} \leq M_\text{safe}.
  4. The batch is filled with up to NmaxN_\text{max} top-priority requests, sequences are padded to SmaxS_\text{max}, and the batch is submitted for prefill.

3. Adaptive Bucket Splitting and Merging

To address non-stationary request distributions and workload evolutions, BucketServe employs algorithmic splitting and merging of buckets.

  • Splitting occurs when a bucket contains significantly more requests below its midpoint than above, and its length exceeds the minimum split size m=Nmaxm=N_\text{max}. The split threshold parameter θ\theta (default $0.5$) controls sensitivity—higher θ\theta results in fewer splits and thus coarser buckets.
  • Merging: If the total number of requests is below NmaxN_\text{max}, all buckets are merged into [0,Lmax)[0,L_\text{max}).
  • Pseudocode is provided for this adaptive process, with O(nk+k)O(nk+k) complexity per bucket adjustment.
Name Operation Type Parameters/Triggers
Bucket Splitting Divide bucket brequests>m|b_\text{requests}|>m, Cs/b ⁣> ⁣θC_s/|b|\!>\!\theta
Bucket Merging Merge buckets total requests<Nmax|\text{total requests}|<N_\text{max}
Boundary Selection Bisection Midpoint of [Lb,Ub)[L_b,U_b)

4. Priority-Aware Scheduling and SLO Compliance

Within each bucket, request priorities pip_i are assigned as a weighted sum:

pi=α(arrival_timei)+β(task_priorityi)+γ(sequence_lengthi)p_i = \alpha \cdot (\text{arrival\_time}_i) + \beta \cdot (\text{task\_priority}_i) + \gamma \cdot (\text{sequence\_length}_i)

where {α,β,γ}\{\alpha, \beta, \gamma\} are tunable. The Dynamic Batching Controller admits requests with the highest pip_i into batches, balancing recency, task urgency, and job size.

  • SLO attainment for a latency bound LsloL_\text{slo}:

Attainment=1Ntotali=1NtotalI{latencyiLslo}\text{Attainment} = \frac{1}{N_\text{total}} \sum_{i=1}^{N_\text{total}} \mathbb{I}\{\text{latency}_i \leq L_\text{slo}\}

  • Scheduler Objective:

max Throughputλ(1Attainment)\max~ \text{Throughput} - \lambda \cdot (1 - \text{Attainment})

with λ\lambda controlling the tradeoff between throughput and SLO adherence.

5. Empirical Evaluation and Performance Metrics

The framework is evaluated on a testbed comprising 4×NVIDIA A100 GPUs (40 GB, NVLink), a 64-core CPU, and 1 TB NVMe SSD, using LLaMA-2 (7B, 13B) and OPT (6.7B) models. Workloads span Stanford Alpaca (short), LongBench (long), and mixed datasets.

Key metrics and results:

Metric UELLM DistServe BucketServe
Throughput (tokens/s, Mixed, 13B) ~8k ~15k ~54k
GPU Utilization (%) 42 55 81.66
SLO Attainment (Alpaca, SLO=200ms) - 60 RPS 82 RPS
SLO Attainment (Mixed, SLO=500ms) - 45 RPS 87 RPS
Bucketing Overhead (%) <1 - <1

Additional findings:

  • Server RPS vs. Client RPS: BucketServe server RPS closely matches incoming request rates up to 190 RPS; DistServe plateaus near 100 RPS; UELLM saturates at ~55 RPS.
  • End-to-End Latency: Decoding accounts for ~90% of latency; bucketing overhead is <1%, remaining constant even as the number of buckets increases from 1 to 16.

6. Practical Considerations, Limitations, and Tuning

BucketServe is optimized for highly heterogeneous workloads and high concurrency scenarios where static or naive continuous batching incurs significant inefficiencies. Its main limitations and tuning insights include:

  • Low RPS Regimes: If request rates fall below NmaxN_\text{max}, buckets are merged, reducing the benefit of fine-grained bucketing.
  • Highly Skewed Workloads: Extreme length distributions may trigger frequent splits, resulting in marginally increased overhead.
  • Architecture: Implementation and empirical validation are limited to single-node deployment; multi-node or cluster-wide coordination is not yet available.
  • Tuning Parameters:
    • Split threshold θ\theta: Higher values (e.g., 0.7) reduce splits and overhead but increase padding; lower values enable finer bucketing with a slight overhead increase.
    • Safe-memory fraction (default 0.9): Can be reduced (e.g., to 0.85) for more aggressive batching at elevated OOM risk.
    • Priority weights {α,β,γ}\{\alpha, \beta, \gamma\}: Tuned to emphasize arrival time, task urgency, or sequence-length bias in scheduling.

7. Future Directions

Development plans entail extending adaptive bucketing and scheduling mechanisms to multi-node serving clusters, integrating load-aware rebalancing strategies, and investigating reinforcement-learning–based scheduling policies for further gains in throughput and SLO compliance (Zheng et al., 23 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BucketServe.