BucketServe: Dynamic Batching for LLM Inference

Updated 21 March 2026

BucketServe is a bucket-based dynamic batching framework that adaptively groups LLM inference requests by sequence length to reduce padding waste and meet latency SLOs.
It features a multi-component architecture—including a Request Bucketing Manager, Dynamic Batching Controller, and priority-aware scheduler—to optimize GPU memory usage.
Empirical evaluations demonstrate that BucketServe significantly increases throughput and GPU utilization while maintaining SLO compliance under diverse, heterogeneous workloads.

BucketServe is a bucket-based dynamic batching framework engineered to optimize inference performance for LLMs under heterogeneous workloads. Unlike traditional LLM serving systems that rely on static or continuous batching—often resulting in inefficient GPU memory utilization and increased latency—BucketServe adaptively groups and schedules requests by sequence length, dynamically adjusts batch sizes to hardware constraints, and integrates priority-aware scheduling to satisfy service level objectives (SLOs) (Zheng et al., 23 Jul 2025). Its design addresses the fundamental tension between maximizing throughput and maintaining strict latency requirements in real-time LLM applications.

1. System Architecture and Component Workflow

BucketServe comprises five primary components: Gateway, Request Bucketing Manager, Dynamic Batching Controller, P/D Scheduler, and Global Monitor. The typical request processing pipeline involves the following stages:

Gateway: Receives user requests and annotates them with metadata including sequence length, task type, and priority.
Request Bucketing Manager: Maintains a set $\mathcal{B} = \{b_1,\ldots,b_K\}$ of buckets, each associated with an interval $[L_b,U_b)$ . An incoming request is assigned to the unique bucket whose interval contains its sequence length. Buckets are dynamically split or merged as workload fluctuates.
Dynamic Batching Controller: Periodically (or when a queue reaches a threshold), for each bucket $b$ , it computes the safe GPU memory $M_\text{safe}=0.9 M_\text{remain}$ . It determines $N_\text{max} = \max\{N~|~\sum_{i=1}^N \text{Memory}_\text{KV}(i)\leq M_\text{safe}\}$ , where $\text{Memory}_\text{KV}(i) = 2\cdot L\cdot H\cdot D\cdot S_\text{max}\cdot B$ , and selects up to $N_\text{max}$ requests for batching and padded submission.
P/D Scheduler: Handles prefill (building key-value (KV) caches on a first-come-first-served (FCFS) basis), orchestrates KV-cache transfer via NVLink, and manages decoding (using continuous batching per Orca-style strategies).
Global Monitor: Tracks GPU and system metrics, feeding back into the Bucketing Manager and Batching Controller for online adjustment.

Pipeline flow, as per the architecture, is:

User → Gateway → Bucketing Manager → Buckets b₁,…,b_K
  │
  └─> Dynamic Batching Controller ──> Prefill Queue ──> Prefill Workers
                                                              ↓ (NVLink)
                                   Decoding Queue ─ Decoding Workers → User

2. Bucket Formation, Waste Minimization, and Dynamic Batching

Bucket formation is realized by partitioning the incoming request stream according to sequence length into $K$ intervals $[L_b,U_b)$ . Each bucket contains requests of approximately similar length, which minimizes input sequence padding and associated computational waste.

Padding Overhead for a batch of $N$ requests with lengths $\{S_i\}$ is quantified as:

$\text{Waste}_\text{Ratio} = \frac{S_\text{max}-S_\text{avg}}{S_\text{max}} \quad \text{(Eq.~2)}$

where $S_\text{max} = \max_i S_i$ , $S_\text{avg} = (1/N)\sum_i S_i$ .

Expected Waste is the aggregate padding overhead across all buckets:

$\mathbb{E}[\text{Waste}] = \sum_{b=1}^K \int_{L_b}^{U_b} \left(1-\frac{S}{U_b}\right) f(S)\, dS \quad \text{(Eq.~3)}$

where $f(S)$ is the PDF of incoming sequence lengths.

Optimal Bucket Boundary to minimize expected waste is specified as:

$U_b^* = \frac{\int_{L_b}^{U_b} S f(S) dS}{\int_{L_b}^{U_b} f(S) dS} \quad \text{(Eq.~4)}$

Practically, bucket boundaries are approximated via midpoint bisection.

Dynamic Batching leverages real-time GPU memory measurements. On each batch cycle:

$M_\text{safe} = 0.9 M_\text{remain}$ is computed.
Per-request memory cost $\text{Memory}_\text{KV}=2 L H D S_\text{max} B$ is calculated.
$N_\text{max}$ is computed such that $N_\text{max}\cdot \text{Memory}_\text{KV} \leq M_\text{safe}$ .
The batch is filled with up to $N_\text{max}$ top-priority requests, sequences are padded to $S_\text{max}$ , and the batch is submitted for prefill.

3. Adaptive Bucket Splitting and Merging

To address non-stationary request distributions and workload evolutions, BucketServe employs algorithmic splitting and merging of buckets.

Splitting occurs when a bucket contains significantly more requests below its midpoint than above, and its length exceeds the minimum split size $m=N_\text{max}$ . The split threshold parameter $\theta$ (default $0.5$) controls sensitivity—higher $\theta$ results in fewer splits and thus coarser buckets.
Merging: If the total number of requests is below $N_\text{max}$ , all buckets are merged into $[0,L_\text{max})$ .
Pseudocode is provided for this adaptive process, with $O(nk+k)$ complexity per bucket adjustment.

Name	Operation Type	Parameters/Triggers
Bucket Splitting	Divide bucket	$\|b_\text{requests}\|>m$ , $C_s/\|b\|\!>\!\theta$
Bucket Merging	Merge buckets	$\|\text{total requests}\|<N_\text{max}$
Boundary Selection	Bisection	Midpoint of $[L_b,U_b)$

4. Priority-Aware Scheduling and SLO Compliance

Within each bucket, request priorities $p_i$ are assigned as a weighted sum:

$p_i = \alpha \cdot (\text{arrival\_time}_i) + \beta \cdot (\text{task\_priority}_i) + \gamma \cdot (\text{sequence\_length}_i)$

where $\{\alpha, \beta, \gamma\}$ are tunable. The Dynamic Batching Controller admits requests with the highest $p_i$ into batches, balancing recency, task urgency, and job size.

SLO attainment for a latency bound $L_\text{slo}$ :

$\text{Attainment} = \frac{1}{N_\text{total}} \sum_{i=1}^{N_\text{total}} \mathbb{I}\{\text{latency}_i \leq L_\text{slo}\}$

Scheduler Objective:

$\max~ \text{Throughput} - \lambda \cdot (1 - \text{Attainment})$

with $\lambda$ controlling the tradeoff between throughput and SLO adherence.

5. Empirical Evaluation and Performance Metrics

The framework is evaluated on a testbed comprising 4×NVIDIA A100 GPUs (40 GB, NVLink), a 64-core CPU, and 1 TB NVMe SSD, using LLaMA-2 (7B, 13B) and OPT (6.7B) models. Workloads span Stanford Alpaca (short), LongBench (long), and mixed datasets.

Key metrics and results:

Metric	UELLM	DistServe	BucketServe
Throughput (tokens/s, Mixed, 13B)	~8k	~15k	~54k
GPU Utilization (%)	42	55	81.66
SLO Attainment (Alpaca, SLO=200ms)	-	60 RPS	82 RPS
SLO Attainment (Mixed, SLO=500ms)	-	45 RPS	87 RPS
Bucketing Overhead (%)	<1	-	<1

Additional findings:

Server RPS vs. Client RPS: BucketServe server RPS closely matches incoming request rates up to 190 RPS; DistServe plateaus near 100 RPS; UELLM saturates at ~55 RPS.
End-to-End Latency: Decoding accounts for ~90% of latency; bucketing overhead is <1%, remaining constant even as the number of buckets increases from 1 to 16.

6. Practical Considerations, Limitations, and Tuning

BucketServe is optimized for highly heterogeneous workloads and high concurrency scenarios where static or naive continuous batching incurs significant inefficiencies. Its main limitations and tuning insights include:

Low RPS Regimes: If request rates fall below $N_\text{max}$ , buckets are merged, reducing the benefit of fine-grained bucketing.
Highly Skewed Workloads: Extreme length distributions may trigger frequent splits, resulting in marginally increased overhead.
Architecture: Implementation and empirical validation are limited to single-node deployment; multi-node or cluster-wide coordination is not yet available.
Tuning Parameters:
- Split threshold $\theta$ : Higher values (e.g., 0.7) reduce splits and overhead but increase padding; lower values enable finer bucketing with a slight overhead increase.
- Safe-memory fraction (default 0.9): Can be reduced (e.g., to 0.85) for more aggressive batching at elevated OOM risk.
- Priority weights $\{\alpha, \beta, \gamma\}$ : Tuned to emphasize arrival time, task urgency, or sequence-length bias in scheduling.

7. Future Directions

Development plans entail extending adaptive bucketing and scheduling mechanisms to multi-node serving clusters, integrating load-aware rebalancing strategies, and investigating reinforcement-learning–based scheduling policies for further gains in throughput and SLO compliance (Zheng et al., 23 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BucketServe.

BucketServe: Dynamic Batching for LLM Inference

1. System Architecture and Component Workflow

2. Bucket Formation, Waste Minimization, and Dynamic Batching

3. Adaptive Bucket Splitting and Merging

4. Priority-Aware Scheduling and SLO Compliance

5. Empirical Evaluation and Performance Metrics

6. Practical Considerations, Limitations, and Tuning

7. Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BucketServe: Dynamic Batching for LLM Inference

1. System Architecture and Component Workflow

2. Bucket Formation, Waste Minimization, and Dynamic Batching

3. Adaptive Bucket Splitting and Merging

4. Priority-Aware Scheduling and SLO Compliance

5. Empirical Evaluation and Performance Metrics

6. Practical Considerations, Limitations, and Tuning

7. Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research