BucketServe: Dynamic Batching for LLM Inference
- BucketServe is a bucket-based dynamic batching framework that adaptively groups LLM inference requests by sequence length to reduce padding waste and meet latency SLOs.
- It features a multi-component architecture—including a Request Bucketing Manager, Dynamic Batching Controller, and priority-aware scheduler—to optimize GPU memory usage.
- Empirical evaluations demonstrate that BucketServe significantly increases throughput and GPU utilization while maintaining SLO compliance under diverse, heterogeneous workloads.
BucketServe is a bucket-based dynamic batching framework engineered to optimize inference performance for LLMs under heterogeneous workloads. Unlike traditional LLM serving systems that rely on static or continuous batching—often resulting in inefficient GPU memory utilization and increased latency—BucketServe adaptively groups and schedules requests by sequence length, dynamically adjusts batch sizes to hardware constraints, and integrates priority-aware scheduling to satisfy service level objectives (SLOs) (Zheng et al., 23 Jul 2025). Its design addresses the fundamental tension between maximizing throughput and maintaining strict latency requirements in real-time LLM applications.
1. System Architecture and Component Workflow
BucketServe comprises five primary components: Gateway, Request Bucketing Manager, Dynamic Batching Controller, P/D Scheduler, and Global Monitor. The typical request processing pipeline involves the following stages:
- Gateway: Receives user requests and annotates them with metadata including sequence length, task type, and priority.
- Request Bucketing Manager: Maintains a set of buckets, each associated with an interval . An incoming request is assigned to the unique bucket whose interval contains its sequence length. Buckets are dynamically split or merged as workload fluctuates.
- Dynamic Batching Controller: Periodically (or when a queue reaches a threshold), for each bucket , it computes the safe GPU memory . It determines , where , and selects up to requests for batching and padded submission.
- P/D Scheduler: Handles prefill (building key-value (KV) caches on a first-come-first-served (FCFS) basis), orchestrates KV-cache transfer via NVLink, and manages decoding (using continuous batching per Orca-style strategies).
- Global Monitor: Tracks GPU and system metrics, feeding back into the Bucketing Manager and Batching Controller for online adjustment.
Pipeline flow, as per the architecture, is:
1 2 3 4 5 |
User → Gateway → Bucketing Manager → Buckets b₁,…,b_K
│
└─> Dynamic Batching Controller ──> Prefill Queue ──> Prefill Workers
↓ (NVLink)
Decoding Queue ─ Decoding Workers → User |
2. Bucket Formation, Waste Minimization, and Dynamic Batching
Bucket formation is realized by partitioning the incoming request stream according to sequence length into intervals . Each bucket contains requests of approximately similar length, which minimizes input sequence padding and associated computational waste.
- Padding Overhead for a batch of requests with lengths is quantified as:
where , .
- Expected Waste is the aggregate padding overhead across all buckets:
where is the PDF of incoming sequence lengths.
- Optimal Bucket Boundary to minimize expected waste is specified as:
Practically, bucket boundaries are approximated via midpoint bisection.
Dynamic Batching leverages real-time GPU memory measurements. On each batch cycle:
- is computed.
- Per-request memory cost is calculated.
- is computed such that .
- The batch is filled with up to top-priority requests, sequences are padded to , and the batch is submitted for prefill.
3. Adaptive Bucket Splitting and Merging
To address non-stationary request distributions and workload evolutions, BucketServe employs algorithmic splitting and merging of buckets.
- Splitting occurs when a bucket contains significantly more requests below its midpoint than above, and its length exceeds the minimum split size . The split threshold parameter (default $0.5$) controls sensitivity—higher results in fewer splits and thus coarser buckets.
- Merging: If the total number of requests is below , all buckets are merged into .
- Pseudocode is provided for this adaptive process, with complexity per bucket adjustment.
| Name | Operation Type | Parameters/Triggers |
|---|---|---|
| Bucket Splitting | Divide bucket | , |
| Bucket Merging | Merge buckets | |
| Boundary Selection | Bisection | Midpoint of |
4. Priority-Aware Scheduling and SLO Compliance
Within each bucket, request priorities are assigned as a weighted sum:
where are tunable. The Dynamic Batching Controller admits requests with the highest into batches, balancing recency, task urgency, and job size.
- SLO attainment for a latency bound :
- Scheduler Objective:
with controlling the tradeoff between throughput and SLO adherence.
5. Empirical Evaluation and Performance Metrics
The framework is evaluated on a testbed comprising 4×NVIDIA A100 GPUs (40 GB, NVLink), a 64-core CPU, and 1 TB NVMe SSD, using LLaMA-2 (7B, 13B) and OPT (6.7B) models. Workloads span Stanford Alpaca (short), LongBench (long), and mixed datasets.
Key metrics and results:
| Metric | UELLM | DistServe | BucketServe |
|---|---|---|---|
| Throughput (tokens/s, Mixed, 13B) | ~8k | ~15k | ~54k |
| GPU Utilization (%) | 42 | 55 | 81.66 |
| SLO Attainment (Alpaca, SLO=200ms) | - | 60 RPS | 82 RPS |
| SLO Attainment (Mixed, SLO=500ms) | - | 45 RPS | 87 RPS |
| Bucketing Overhead (%) | <1 | - | <1 |
Additional findings:
- Server RPS vs. Client RPS: BucketServe server RPS closely matches incoming request rates up to 190 RPS; DistServe plateaus near 100 RPS; UELLM saturates at ~55 RPS.
- End-to-End Latency: Decoding accounts for ~90% of latency; bucketing overhead is <1%, remaining constant even as the number of buckets increases from 1 to 16.
6. Practical Considerations, Limitations, and Tuning
BucketServe is optimized for highly heterogeneous workloads and high concurrency scenarios where static or naive continuous batching incurs significant inefficiencies. Its main limitations and tuning insights include:
- Low RPS Regimes: If request rates fall below , buckets are merged, reducing the benefit of fine-grained bucketing.
- Highly Skewed Workloads: Extreme length distributions may trigger frequent splits, resulting in marginally increased overhead.
- Architecture: Implementation and empirical validation are limited to single-node deployment; multi-node or cluster-wide coordination is not yet available.
- Tuning Parameters:
- Split threshold : Higher values (e.g., 0.7) reduce splits and overhead but increase padding; lower values enable finer bucketing with a slight overhead increase.
- Safe-memory fraction (default 0.9): Can be reduced (e.g., to 0.85) for more aggressive batching at elevated OOM risk.
- Priority weights : Tuned to emphasize arrival time, task urgency, or sequence-length bias in scheduling.
7. Future Directions
Development plans entail extending adaptive bucketing and scheduling mechanisms to multi-node serving clusters, integrating load-aware rebalancing strategies, and investigating reinforcement-learning–based scheduling policies for further gains in throughput and SLO compliance (Zheng et al., 23 Jul 2025).