Binocular Batching for LLM Inference
- Binocular batching is a technique that partitions LLM inference requests into two bins—short and long—based on predicted execution times to reduce overall service time.
- By grouping requests with similar expected processing times, the method minimizes waiting delays and improves throughput by up to 31% in uniform scenarios.
- The approach employs queueing-theoretic models and lightweight predictors, ensuring robust performance even with moderate prediction errors.
Binocular batching is a two-bin instantiation of the Multi-Bin Batching framework, designed to increase throughput during inference of LLMs by leveraging queueing-theoretic control policies that group requests with similar execution lengths. This method addresses the inefficiency arising from heterogeneous request runtimes in traditional batching, which causes hardware to be underutilized while waiting for the longest request to complete. By partitioning incoming requests—via prediction—into “short” and “long” bins at the median of the execution time distribution, binocular batching forms uniform-length batches and provably improves both resource utilization and LLM inference throughput (Guldogan et al., 2024).
1. Queueing-Theoretic Model and Problem Formulation
Binocular batching operates within a queueing-theoretic framework. The system is modeled as follows:
- Requests arrive according to a Poisson process with rate .
- The server maintains a single, infinite-buffer queue, processing fixed-size batches.
- Each incoming request’s service (generation) time is independent and identically distributed, with a uniform distribution .
- The service time of a batch, , is defined as , i.e., the maximum of the batch’s request times. This setup reflects typical GPU-backed LLM inference workloads, where maximizing throughput is constrained by the slowest request in each batch.
2. Two-Bin Partitioning and Threshold Assignment
The engine of binocular batching is the partition of the request length domain into two intervals of equal probability mass under the uniform distribution. The unique optimal placement for the separation threshold, which maximizes throughput under the uniform law, is:
Bin 1 comprises ("short"), and Bin 2 comprises ("long"). In practical deployments, the unknown true is replaced by a contemporaneous or historical prediction , using lightweight regression proxies or small models, and the request is binned based on whether or not.
3. Binocular Batching Control Policy
The control policy is implemented as two independent sub-queues, each corresponding to one bin. Pending requests are enqueued according to their estimated generation length. When any bin’s sub-queue accumulates requests, a batch is formed and appended to a central service queue. The server pulls from this central queue in a first-completed-batch, first-served fashion. The full policy, matching Algorithm 1 in (Guldogan et al., 2024) specialized to , is described as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Initialize empty queues Q1, Q2; central service queue S = []
Upon arrival of request r:
estimate execution length hat_l
if hat_l < l1:
Q1.enqueue(r)
else:
Q2.enqueue(r)
for Qi in {Q1, Q2}:
while Qi.size() >= B:
batch = Qi.dequeue(B)
S.append(batch)
While server is free and S is not empty:
batch = S.pop()
run_model(batch) # service time = max(actual request lengths) |
Requests are thus always processed in homogeneous-length cohorts, mitigating performance bottlenecking by outlier requests.
4. Analytical Formulas, Throughput, and Optimality
The expected service time for a batch in binocular batching is the probabilistic mixture of the maxima over two uniform intervals. Let and denote the bin endpoints, and the batch size:
- For uniform , .
The key throughput and resource utilization formulas for binocular () batching are:
- Expected service time:
- Throughput:
- Stability condition (resource utilization):
Throughput optimality is established by convexity arguments, showing that minimizing with respect to is achieved when both bins have equal mass, producing a strictly higher throughput than single-queue () batching, and that performance monotonically increases with bin count ( increases with ).
5. Empirical Results and Performance Gains
Empirical findings in (Guldogan et al., 2024) demonstrate notable throughput improvements:
- In a simulated uniform scenario with , , and , standard (1-bin) throughput is approximately $6.45$ requests/sec, while the 2-bin (binocular) scheme achieves $8.43$ requests/sec, a gain of nearly .
- In LLM-in-the-loop deployments (Phi-3.5-min-instruct, , oracle length knowledge), 2-bin throughput is $1.2$– higher than single-queue, and end-to-end experiments report throughput improvements.
Robustness analysis demonstrates that with two bins and length-prediction error probability up to $0.1$, throughput remains within of the oracle (error-free) two-bin throughput, indicating resilience to moderate prediction inaccuracies.
6. Implementation Notes and Practical Trade-offs
Predicting request execution length can leverage simple linear proxies or compact regressors, and accuracy needs only to reliably distinguish “short” from “long” with correctness to capture the majority of throughput gains. For non-uniform length distributions, the bin threshold should correspond to the 50th percentile of predicted lengths.
A trade-off of binocular batching is that splitting the arrival stream into two bins halves the arrival rate per bin, slightly increasing the time to fill a batch. Under typical loads, this overhead is limited (–$10$ ms) and is counterbalanced by the $20$– throughput increase. Production systems should implement “max-wait” timers (e.g., $10$ ms) to bound latency growth during temporary underload. Integration with GPU batch feeders requires maintaining two sub-queues at the front-end and dispatching batch-ready groups promptly to the hardware scheduler.
7. Summary and Context within Multi-Bin Batching
Binocular batching is a specialization of Multi-Bin Batching to . It partitions incoming inference requests into "short" and "long" bins based on a central threshold, then batches and serves them to maximize throughput. The method provably and empirically reduces expected batch service times by minimizing the per-batch maximum, and consistently achieves significant throughput advantages over naive batching in both simulation and LLM production settings, with robust performance under length prediction errors and minimal practical implementation overhead (Guldogan et al., 2024).